Optimize large Top N queries #17141

lnkuiper · 2025-04-16T08:30:17Z

With large ORDER BY ... LIMIT ... queries, at some point, it becomes better to fully sort the data and apply a limit instead of computing a heap. I think @Tmonster is addressing this in a separate PR in the optimizer.

Nonetheless, I looked at the PhysicalTopN operator and found that some things could be improved, mostly the merging of heaps, which is inefficient if you iterate and do pop_heap and push_heap. Instead, we are better off just sorting the heap. I also parallelized scanning.

I ran this little benchmark to check the performance. Create some data:

create table test25 as select range i from range(25_000_000) order by hash(i);
create table test50 as select range i from range(50_000_000) order by hash(i);

Query:

select any_value(columns(*)) from (from test25 order by i limit 5_000_000);
select any_value(columns(*)) from (from test50 order by i limit 5_000_000);

Current main: ~5.5s and ~7.6s
This PR: ~2.7s and ~3.8s
Without TopN optimization: ~0.25s and ~0.43s

There is still a lot of room for improvement in the PhysicalTopN operator for large inputs, but I'm not sure if we should spend too much time on this because we can just use the optimizer to avoid using a TopN and sort the data if N is large, but I figured this was worth the effort. With this input data, I found that the Top N was only better than sorting for limits less than 250k.

carlopi · 2025-04-16T09:28:57Z

src/execution/operator/order/physical_top_n.cpp

-		std::pop_heap(heap_copy.begin(), heap_copy.end());
-		state.scan_order[heap_copy.size() - 1] = UnsafeNumericCast<sel_t>(heap_copy.back().index);
-		heap_copy.pop_back();
+


Just for my understanding, should there not be a sort here? I am not sure I see where the sort happens.

The sort happens in Finalize(), I can clarify with a comment

All good, thanks

Mytherin · 2025-04-17T05:58:22Z

Thanks!

Optimize large Top N queries (duckdb/duckdb#17141)

lnkuiper added 5 commits April 15, 2025 14:22

sort instead of popping

69ec7e4

reduce lock contention by using std::sort, and parallelize scans

96f5029

Merge branch 'main' into big_top_n

3cced5e

add cast

15de15c

clean up

814b862

carlopi reviewed Apr 16, 2025

View reviewed changes

duckdb-draftbot marked this pull request as draft April 16, 2025 10:31

more cleanup and clarification

ecf6279

lnkuiper marked this pull request as ready for review April 16, 2025 14:25

Mytherin merged commit b99d7fd into duckdb:main Apr 17, 2025
48 checks passed

lnkuiper mentioned this pull request Apr 17, 2025

Only trigger TopN rewrite relatively small limits compared to the table size. #17140

Merged

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@b99d7fd

6fccc0d

Optimize large Top N queries (duckdb/duckdb#17141)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@b99d7fd

b55055b

Optimize large Top N queries (duckdb/duckdb#17141)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025

vendor: Update vendored sources to duckdb/duckdb@b99d7fd

c1c2924

Optimize large Top N queries (duckdb/duckdb#17141)

xuke-hat mentioned this pull request May 27, 2025

Don't bail on TopN optimization if we don't have a cardinality #17654

Merged

lnkuiper deleted the big_top_n branch July 8, 2025 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize large Top N queries #17141

Optimize large Top N queries #17141

lnkuiper commented Apr 16, 2025

Uh oh!

carlopi Apr 16, 2025

Uh oh!

lnkuiper Apr 16, 2025

Uh oh!

carlopi Apr 16, 2025

Uh oh!

Uh oh!

Mytherin commented Apr 17, 2025

Uh oh!

Uh oh!

Optimize large Top N queries #17141

Optimize large Top N queries #17141

Conversation

lnkuiper commented Apr 16, 2025

Uh oh!

carlopi Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

lnkuiper Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

carlopi Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mytherin commented Apr 17, 2025

Uh oh!

Uh oh!