Top-N Rework using new sort code #2172

Mytherin · 2021-08-21T11:58:09Z

This PR reworks the top-n implementation (i.e. ORDER BY x LIMIT 5) to use the new sort code instead of the old Value API-based heap code. How it works is that we stream in the input values into the Top-N operator and buffer them. When we exceed either 5000 tuples or offset + limit * 2, whichever is bigger, we perform a sort and throw away all values except the top offset + limit. We then extract the very last element, and use that element as a pre-filter for subsequent appends.

For example, suppose we do a ORDER BY x LIMIT 5 on a table that has the x values of [0..1000] repeating several times (10 0, 1, 2, .., 9999, 0, 1, 2, ..., 9999. We will first gather the first 5000 values, do a sort, and figure out the top 5 values are [0, 1, 2, 3, 4]. We will then reject any values of x that are >= 4 (since we know they cannot be part of the result anymore).

In the usual case, the pre-filter will allow us to do very few sorts, and the sorts that we perform are very optimized, so we get quite good performance here. It is also easily executed in parallel, and has a very low memory footprint, since we only need to keep the current top tuples.

The main problem with this approach is the adversarial case in which the input is sorted in the exact opposite manner in which we are attempting to extract a top-n, in which case we will constantly need to re-sort small chunks of the input. This is particularly problematic when we have large payloads, since currently we re-shuffle the payloads on every iteration. See also below the OrderKey DESC Large Payload query. This is something we can hopefully address in the future.

Comparison vs Old Version

# DATE Large Payload
SELECT * FROM lineitem ORDER BY l_shipdate, l_orderkey LIMIT 5;
# DATE Small Payload
SELECT l_shipdate, l_orderkey FROM lineitem ORDER BY l_shipdate, l_orderkey LIMIT 5;
# OrderKey DESC Large Payload
SELECT * FROM lineitem ORDER BY l_orderkey DESC, l_shipdate DESC LIMIT 5;

System	DATE Large Payload	DATE Small Payload	OrderKey DESC Large Payload
DuckDB New Top-N	0.057s	0.005s	0.180s
DuckDB Old Top-N	0.850s	0.067s	0.900s
DuckDB Order + Limit (full sort of table)	1.110s	0.210s	0.840s
SQLite	2.900s	1.500s	2.900s

…nk_collection

This PR reworks the Top-N implementation to use a heap of sort keys. Previously, we used to lean on our sort implementation, and would "sort of" make a heap by re-sorting and discarding entries, in combination with some early filtering. See #2172 The main reason we implemented it this way is that we had to implement the Top-N operator for many types, included nested types, and it was easier to lean on the existing sort implementation - which was also an improvement over the `Value`-based implementation we had previously. Now that we have sort keys, it is much easier to implement the Top-N algorithm using an actual heap - by leveraging sort keys. This PR does exactly that - and implements sort keys using a heap from the `std` (using `std::push_heap` and `std::pop_heap` over a vector). This allows some clean-up of code as we can remove specialized code (`VectorOperations::DistinctLessThanNullsFirst`/`VectorOperations::DistinctGreaterThanNullsFirst`). In addition, we improve performance in many cases. In particular, sort keys allow us to also easily keep track of a "global boundary value" across all threads - that allows us to do much more skipping in the adversarial case where data is reverse-sorted on the order key. This makes performance much more stable. Below are some performance numbers running on TPC-H SF10: ```sql -- natural sort order, small limit, large payload SELECT * FROM lineitem ORDER BY l_orderkey LIMIT 5; -- old: 0.18s, new: 0.22s -- inverse natural sort order, small limit, large payload SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 5; -- old: 0.76s, new: 0.24s -- inverse natural sort order, large limit, large payload SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 10000; -- old: 1.59s, new: 0.34s -- natural sort order, small limit, small payload SELECT l_orderkey FROM lineitem ORDER BY l_orderkey LIMIT 5; -- old: 0.03s, new: 0.06s -- inverse natural sort order, small limit, small payload SELECT l_orderkey FROM lineitem ORDER BY l_orderkey DESC LIMIT 5; -- old: 0.16s, new: 0.07s -- inverse natural sort order, small limit, large payload SELECT l_orderkey FROM lineitem ORDER BY l_orderkey DESC LIMIT 10000; -- old: 0.32s, new: 0.14s ``` In general, we can see that performance is much more stable and greatly improved in several cases. There are a number of small regressions - in particular when sorting on individual integer keys in natural sort order the old algorithm is sometimes better. That is mostly because in these cases we can filter out values immediately. In the old implementation we would figure this out directly with the sort values, whereas in the new implementation we still spend time constructing the sort keys. We could remedy that by adding templated heaps for primitive types in the future.

Mytherin added 18 commits August 20, 2021 14:43

Top-N rework sort of working

09bcd26

Merge branch 'master' into topn

45a6999

Top-N fixes for NULLS FIRST/NULLS LAST and more tests

858322c

Top-N with constant column

68be1b3

Format fix

172f348

Fix empty top-n on empty table

da407fc

Remove heapify code from chunk collection

c6abfe1

Tidy fix

011a3c6

One more benchmark

58a9253

Merge branch 'master' into topn

0c34975

Merge fix

6bc476f

Correctly slice boundary value comparison chunk in top n

09c19e7

Format fix

8684d73

Need to use compare_chunk here as well, and remove dead code from chu…

7aecb7f

…nk_collection

Fix for top-n edge case where offset exactly matches vector size

fdb4393

Format fix

23a1849

Fixes for top-n with nulls and lists/structs, and more tests

2748b29

Format fix

a5479ff

Mytherin merged commit 9c57183 into duckdb:master Aug 26, 2021

Mytherin deleted the topn branch October 19, 2021 18:02

Mytherin mentioned this pull request Oct 17, 2024

Top-N: Rework to use heap of sort keys #14424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Top-N Rework using new sort code #2172

Top-N Rework using new sort code #2172

Uh oh!

Mytherin commented Aug 21, 2021 •

edited

Loading

Uh oh!

Uh oh!

Top-N Rework using new sort code #2172

Top-N Rework using new sort code #2172

Uh oh!

Conversation

Mytherin commented Aug 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comparison vs Old Version

Uh oh!

Uh oh!

Mytherin commented Aug 21, 2021 •

edited

Loading