-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Dynamically push table filters from Top-N operator #15099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
+309
−48
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…prepare for easier join sharing
Mytherin
added a commit
that referenced
this pull request
Dec 8, 2024
Follow-up from #15099, enable this for `VARCHAR` columns. Also fix an issue in `DynamicFilter::ToExpression` which was incorrectly not checking the initialized flag.
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 27, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 27, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)
krlmlr
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 27, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)
github-actions bot
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 28, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099)
github-actions bot
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 28, 2024
Dynamically push table filters from Top-N operator (duckdb/duckdb#15099) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When executing a Top-N query, we need to find the top or bottom
N
values for a given set of conditions. The Top-N operator builds the result through a heap in which we keep track of the top-N values seen so far. For any given heap ofN
values, we know that any row that has an ordering column value that is larger (or smaller, forDESC
ordering) will not be in the result set at that moment.In the Top-N operator, we already keep track of this value - called the
boundary value
- and use it to prune out rows that we know are not going to be inserted into the heap early on in theSink
call. While this works in speeding up the Top-N operator itself, we still need to scan the rows from the base storage before finding out these rows will not make it into the final result.This PR extends the Top-N operator to push the boundary value into the scans as a table filter. This is similar to the dynamic table filters generated through joins (introduced in #12908), but "more dynamic". While the filters generated through joins are generated once (when the HT build is complete), the Top-N filters are updated whenever the boundary value is updated - which can happen for every
Sink
call.Implementation
The implementation works through a new table filter - the
DynamicFilter
. This is a regularTableFilter
that holds a shared pointer to aDynamicFilterData
- which contains a child table filter together with a lock:This filter is generated in the
TopN
optimizer. TheTopN
contains a shared pointer to theDynamicFilterData
as well. TheTopN
operator then updates the underlying filter during theSink
phase whenever the global boundary value is updated to a new value.Performance
Below is an illustration of the performance gain we can obtain here, running this query over TPC-H SF10:
Note that this is not always beneficial, e.g. when looking in descending order in this scenario. Since the table is naturally ordered and we scan data in the table's natural order, we are never able to prune any row groups:
Limitations
ORDER BY
clause has multiple order conditions, we can only generate the filter for the first order condition (since the value of the remaining ones is unknown).NULLS LAST
ordering. It is possible to extend toNULLS FIRST
, but this is more tricky as we need to takeNULL
values into account in the generated filters/boundary value.