-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Optimize large Top N queries #17141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize large Top N queries #17141
Conversation
std::pop_heap(heap_copy.begin(), heap_copy.end()); | ||
state.scan_order[heap_copy.size() - 1] = UnsafeNumericCast<sel_t>(heap_copy.back().index); | ||
heap_copy.pop_back(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding, should there not be a sort here? I am not sure I see where the sort happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort happens in Finalize()
, I can clarify with a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, thanks
Thanks! |
Optimize large Top N queries (duckdb/duckdb#17141)
Optimize large Top N queries (duckdb/duckdb#17141)
Optimize large Top N queries (duckdb/duckdb#17141)
With large
ORDER BY ... LIMIT ...
queries, at some point, it becomes better to fully sort the data and apply a limit instead of computing a heap. I think @Tmonster is addressing this in a separate PR in the optimizer.Nonetheless, I looked at the
PhysicalTopN
operator and found that some things could be improved, mostly the merging of heaps, which is inefficient if you iterate and dopop_heap
andpush_heap
. Instead, we are better off just sorting the heap. I also parallelized scanning.I ran this little benchmark to check the performance. Create some data:
Query:
Current
main
: ~5.5s and ~7.6sThis PR: ~2.7s and ~3.8s
Without TopN optimization: ~0.25s and ~0.43s
There is still a lot of room for improvement in the
PhysicalTopN
operator for large inputs, but I'm not sure if we should spend too much time on this because we can just use the optimizer to avoid using a TopN and sort the data if N is large, but I figured this was worth the effort. With this input data, I found that the Top N was only better than sorting for limits less than 250k.