Skip to content

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Dec 2, 2024

When executing a Top-N query, we need to find the top or bottom N values for a given set of conditions. The Top-N operator builds the result through a heap in which we keep track of the top-N values seen so far. For any given heap of N values, we know that any row that has an ordering column value that is larger (or smaller, for DESC ordering) will not be in the result set at that moment.

In the Top-N operator, we already keep track of this value - called the boundary value - and use it to prune out rows that we know are not going to be inserted into the heap early on in the Sink call. While this works in speeding up the Top-N operator itself, we still need to scan the rows from the base storage before finding out these rows will not make it into the final result.

This PR extends the Top-N operator to push the boundary value into the scans as a table filter. This is similar to the dynamic table filters generated through joins (introduced in #12908), but "more dynamic". While the filters generated through joins are generated once (when the HT build is complete), the Top-N filters are updated whenever the boundary value is updated - which can happen for every Sink call.

Implementation

The implementation works through a new table filter - the DynamicFilter. This is a regular TableFilter that holds a shared pointer to a DynamicFilterData - which contains a child table filter together with a lock:

struct DynamicFilterData {
	mutex lock;
	unique_ptr<TableFilter> filter;
	bool initialized = false;
};

This filter is generated in the TopN optimizer. The TopN contains a shared pointer to the DynamicFilterData as well. The TopN operator then updates the underlying filter during the Sink phase whenever the global boundary value is updated to a new value.

Performance

Below is an illustration of the performance gain we can obtain here, running this query over TPC-H SF10:

SELECT * FROM lineitem ORDER BY l_orderkey LIMIT 5;
main New
0.19s 0.02s

Note that this is not always beneficial, e.g. when looking in descending order in this scenario. Since the table is naturally ordered and we scan data in the table's natural order, we are never able to prune any row groups:

SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 5;
main New
0.23s 0.23s

Limitations

  • When the ORDER BY clause has multiple order conditions, we can only generate the filter for the first order condition (since the value of the remaining ones is unknown).
  • We currently only support NULLS LAST ordering. It is possible to extend to NULLS FIRST, but this is more tricky as we need to take NULL values into account in the generated filters/boundary value.

@Mytherin Mytherin merged commit a987c5a into duckdb:main Dec 3, 2024
42 checks passed
@Mytherin Mytherin deleted the dynamictopnboundary branch December 8, 2024 06:50
Mytherin added a commit that referenced this pull request Dec 8, 2024
Follow-up from #15099, enable this
for `VARCHAR` columns.

Also fix an issue in `DynamicFilter::ToExpression` which was incorrectly
not checking the initialized flag.
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 28, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 28, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant