Skip to content

Conversation

lnkuiper
Copy link
Contributor

With large ORDER BY ... LIMIT ... queries, at some point, it becomes better to fully sort the data and apply a limit instead of computing a heap. I think @Tmonster is addressing this in a separate PR in the optimizer.

Nonetheless, I looked at the PhysicalTopN operator and found that some things could be improved, mostly the merging of heaps, which is inefficient if you iterate and do pop_heap and push_heap. Instead, we are better off just sorting the heap. I also parallelized scanning.

I ran this little benchmark to check the performance. Create some data:

create table test25 as select range i from range(25_000_000) order by hash(i);
create table test50 as select range i from range(50_000_000) order by hash(i);

Query:

select any_value(columns(*)) from (from test25 order by i limit 5_000_000);
select any_value(columns(*)) from (from test50 order by i limit 5_000_000);

Current main: ~5.5s and ~7.6s
This PR: ~2.7s and ~3.8s
Without TopN optimization: ~0.25s and ~0.43s

There is still a lot of room for improvement in the PhysicalTopN operator for large inputs, but I'm not sure if we should spend too much time on this because we can just use the optimizer to avoid using a TopN and sort the data if N is large, but I figured this was worth the effort. With this input data, I found that the Top N was only better than sorting for limits less than 250k.

std::pop_heap(heap_copy.begin(), heap_copy.end());
state.scan_order[heap_copy.size() - 1] = UnsafeNumericCast<sel_t>(heap_copy.back().index);
heap_copy.pop_back();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, should there not be a sort here? I am not sure I see where the sort happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort happens in Finalize(), I can clarify with a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, thanks

@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 16, 2025 10:31
@lnkuiper lnkuiper marked this pull request as ready for review April 16, 2025 14:25
@Mytherin Mytherin merged commit b99d7fd into duckdb:main Apr 17, 2025
48 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
@lnkuiper lnkuiper deleted the big_top_n branch July 8, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants