-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Top-N: Rework to use heap of sort keys #14424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I have no remarks, just an opportunity for future performance improvement:
auto sort_key_values = FlatVector::GetData<string_t>(sort_keys_vec); | ||
for (idx_t r = 0; r < input.size(); r++) { | ||
auto &sort_key = sort_key_values[r]; | ||
if (!boundary_val.empty() && sort_key > global_boundary_val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remedy that by adding templated heaps for primitive types in the future.
Potentially, we could get a lot of the benefit without templating by recognizing when the sort keys have a fixed size and using a fixed size memcmp
instead of the regular string comparison. The less-than comparison could look like this: FastMemcmp(lhs.GetPrefix(), rhs.GetPrefix(), fixed_size) < 0
(if the fixed size is less than the inline length). This should give a pretty good performance improvement for comparisons without having to use a lot of templating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I gave it a shot here, but the difference in end-to-end performance is negligible at least on the above queries. Profiling this it seems the main performance cost is in constructing the sort keys - and not the comparisons themselves.
I think to get optimal performance for simple keys we'll need to template this, but I'll leave that for a future PR. I think the difference in performance is not large enough to worry about.
#14900) Follow-up from #14424 Fixes #14896 #### Large Heaps When using a large `offset`, the heap we generate in the Top-N operator is large. The previous implementation took some shortcuts that made it work poorly with large heaps, resulting in performance degradations compared to the previous implementation: * `TopNHeap::Combine` would scan and re-insert the values in the heaps, instead of directly merging heaps * `TopNHeap::Sink` would fully scan the heap at every iteration This PR fixes these issues. ### `AddSmallHeap`/`AddLargeHeap` The previous heap insertion happened in two stages: * Insert the sort keys one-by-one into the heap * Scan the heap to figure out which **final sort keys** still remain inside the heap * Append the payload for the corresponding final sort keys The main reason for this two-stage approach is to deal with many conflicts *within the same data chunk*. For example, consider the query: ```sql SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 5; ``` Since `lineitem` is sorted by `l_orderkey`, every chunk we stream into the operator will be full of the new highest values. Without this two-stage approach, we will append all rows to the payload data, only to then discard most rows again. Using the two stage approach we append at most 5 rows (the heap size) to the payload at each iteration, leading to lower memory usage and fewer calls to Reduce. ### AddLargeHeap This PR adds a new, simpler, single-stage insertion where we insert the sort keys and immediately insert payload data. We switch to this approach for heaps `>= 100 rows`. By immediately inserting the payload data we don't need to scan the heap during the sink phase, which has great speed-ups for when we are dealing with larger heaps. ### Benchmarks Below is an example of a Top-N with a large heap that is sped up significantly by this approach: ```sql select l_orderkey from lineitem where l_linestatus = 'O' order by l_quantity limit 100 offset 1000000; ``` | v1.1 | main | new | |-------|------|-------| | 0.86s | 6.9s | 0.77s |
…ion (#15087) This PR is essentially a partial revert of #14424 - that fixes a performance regression that was caused by the Top-N operator doing the global boundary checking **after** constructing the sort key, instead of before. This meant that, even for rows where we would be certain they would not be added to the heap, we would be constructing sort keys. This PR brings back the old code that allows the boundary values to be checked directly on the sort keys. In order to facilitate this, we deserialize the global boundary value (stored as a sort key) back into a `DataChunk` that can be compared against directly.
This PR reworks the Top-N implementation to use a heap of sort keys. Previously, we used to lean on our sort implementation, and would "sort of" make a heap by re-sorting and discarding entries, in combination with some early filtering. See #2172
The main reason we implemented it this way is that we had to implement the Top-N operator for many types, included nested types, and it was easier to lean on the existing sort implementation - which was also an improvement over the
Value
-based implementation we had previously. Now that we have sort keys, it is much easier to implement the Top-N algorithm using an actual heap - by leveraging sort keys. This PR does exactly that - and implements sort keys using a heap from thestd
(usingstd::push_heap
andstd::pop_heap
over a vector).This allows some clean-up of code as we can remove specialized code (
VectorOperations::DistinctLessThanNullsFirst
/VectorOperations::DistinctGreaterThanNullsFirst
). In addition, we improve performance in many cases. In particular, sort keys allow us to also easily keep track of a "global boundary value" across all threads - that allows us to do much more skipping in the adversarial case where data is reverse-sorted on the order key. This makes performance much more stable.Below are some performance numbers running on TPC-H SF10:
In general, we can see that performance is much more stable and greatly improved in several cases. There are a number of small regressions - in particular when sorting on individual integer keys in natural sort order the old algorithm is sometimes better. That is mostly because in these cases we can filter out values immediately. In the old implementation we would figure this out directly with the sort values, whereas in the new implementation we still spend time constructing the sort keys. We could remedy that by adding templated heaps for primitive types in the future.