Skip to content

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Oct 17, 2024

This PR reworks the Top-N implementation to use a heap of sort keys. Previously, we used to lean on our sort implementation, and would "sort of" make a heap by re-sorting and discarding entries, in combination with some early filtering. See #2172

The main reason we implemented it this way is that we had to implement the Top-N operator for many types, included nested types, and it was easier to lean on the existing sort implementation - which was also an improvement over the Value-based implementation we had previously. Now that we have sort keys, it is much easier to implement the Top-N algorithm using an actual heap - by leveraging sort keys. This PR does exactly that - and implements sort keys using a heap from the std (using std::push_heap and std::pop_heap over a vector).

This allows some clean-up of code as we can remove specialized code (VectorOperations::DistinctLessThanNullsFirst/VectorOperations::DistinctGreaterThanNullsFirst). In addition, we improve performance in many cases. In particular, sort keys allow us to also easily keep track of a "global boundary value" across all threads - that allows us to do much more skipping in the adversarial case where data is reverse-sorted on the order key. This makes performance much more stable.

Below are some performance numbers running on TPC-H SF10:

-- natural sort order, small limit, large payload
SELECT * FROM lineitem ORDER BY l_orderkey LIMIT 5;
-- old: 0.18s, new: 0.22s

-- inverse natural sort order, small limit, large payload
SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 5;
-- old: 0.76s, new: 0.24s

-- inverse natural sort order, large limit, large payload
SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 10000;
-- old: 1.59s, new: 0.34s

-- natural sort order, small limit, small payload
SELECT l_orderkey FROM lineitem ORDER BY l_orderkey LIMIT 5;
-- old: 0.03s, new: 0.06s

-- inverse natural sort order, small limit, small payload
SELECT l_orderkey FROM lineitem ORDER BY l_orderkey DESC LIMIT 5;
-- old: 0.16s, new: 0.07s

-- inverse natural sort order, small limit, large payload
SELECT l_orderkey FROM lineitem ORDER BY l_orderkey DESC LIMIT 10000;
-- old: 0.32s, new: 0.14s

In general, we can see that performance is much more stable and greatly improved in several cases. There are a number of small regressions - in particular when sorting on individual integer keys in natural sort order the old algorithm is sometimes better. That is mostly because in these cases we can filter out values immediately. In the old implementation we would figure this out directly with the sort values, whereas in the new implementation we still spend time constructing the sort keys. We could remedy that by adding templated heaps for primitive types in the future.

@Mytherin Mytherin requested a review from lnkuiper October 17, 2024 20:34
Copy link
Contributor

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I have no remarks, just an opportunity for future performance improvement:

auto sort_key_values = FlatVector::GetData<string_t>(sort_keys_vec);
for (idx_t r = 0; r < input.size(); r++) {
auto &sort_key = sort_key_values[r];
if (!boundary_val.empty() && sort_key > global_boundary_val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remedy that by adding templated heaps for primitive types in the future.

Potentially, we could get a lot of the benefit without templating by recognizing when the sort keys have a fixed size and using a fixed size memcmp instead of the regular string comparison. The less-than comparison could look like this: FastMemcmp(lhs.GetPrefix(), rhs.GetPrefix(), fixed_size) < 0 (if the fixed size is less than the inline length). This should give a pretty good performance improvement for comparisons without having to use a lot of templating.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I gave it a shot here, but the difference in end-to-end performance is negligible at least on the above queries. Profiling this it seems the main performance cost is in constructing the sort keys - and not the comparisons themselves.

I think to get optimal performance for simple keys we'll need to template this, but I'll leave that for a future PR. I think the difference in performance is not large enough to worry about.

@Mytherin Mytherin merged commit 1979504 into duckdb:feature Oct 18, 2024
40 checks passed
Mytherin added a commit that referenced this pull request Nov 19, 2024
#14900)

Follow-up from #14424

Fixes #14896

#### Large Heaps

When using a large `offset`, the heap we generate in the Top-N operator
is large. The previous implementation took some shortcuts that made it
work poorly with large heaps, resulting in performance degradations
compared to the previous implementation:

* `TopNHeap::Combine` would scan and re-insert the values in the heaps,
instead of directly merging heaps
* `TopNHeap::Sink` would fully scan the heap at every iteration

This PR fixes these issues.

### `AddSmallHeap`/`AddLargeHeap`

The previous heap insertion happened in two stages:

* Insert the sort keys one-by-one into the heap
* Scan the heap to figure out which **final sort keys** still remain
inside the heap
* Append the payload for the corresponding final sort keys

The main reason for this two-stage approach is to deal with many
conflicts *within the same data chunk*. For example, consider the query:

```sql
SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 5;
```

Since `lineitem` is sorted by `l_orderkey`, every chunk we stream into
the operator will be full of the new highest values. Without this
two-stage approach, we will append all rows to the payload data, only to
then discard most rows again. Using the two stage approach we append at
most 5 rows (the heap size) to the payload at each iteration, leading to
lower memory usage and fewer calls to Reduce.

### AddLargeHeap

This PR adds a new, simpler, single-stage insertion where we insert the
sort keys and immediately insert payload data. We switch to this
approach for heaps `>= 100 rows`. By immediately inserting the payload
data we don't need to scan the heap during the sink phase, which has
great speed-ups for when we are dealing with larger heaps.

### Benchmarks

Below is an example of a Top-N with a large heap that is sped up
significantly by this approach:

```sql
select l_orderkey
from lineitem
where l_linestatus = 'O'
order by l_quantity limit 100 offset 1000000;
```

| v1.1  | main |  new  |
|-------|------|-------|
| 0.86s | 6.9s | 0.77s |
Mytherin added a commit that referenced this pull request Dec 2, 2024
…ion (#15087)

This PR is essentially a partial revert of
#14424 - that fixes a performance
regression that was caused by the Top-N operator doing the global
boundary checking **after** constructing the sort key, instead of
before. This meant that, even for rows where we would be certain they
would not be added to the heap, we would be constructing sort keys. This
PR brings back the old code that allows the boundary values to be
checked directly on the sort keys. In order to facilitate this, we
deserialize the global boundary value (stored as a sort key) back into a
`DataChunk` that can be compared against directly.
@Mytherin Mytherin deleted the topnpriorityqueue branch December 8, 2024 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants