-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Generate In-Clause filters from hash joins #14864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ning the full column
…ng OR filters in the dynamic filter pushdown
After looking at #14864, I realized we aren't handling optional filters in arrow scans very well. We filter them out before transform filter, when they should be filtered out in transform filter. Since InFilters are pushed as children of optional filters, we should not check them in Transform Filter. The switch should be a white-list for what is allowed, and the default case should throw at every other filter type. Also added tests to catch In filter pushdown. Both from `IN` and from joins.
Small question here! My ability to read C++ is still limited... ;-) Does this PR enable OR filter pushdown for non-integer columns like strings and floats? Or is it integers only at the moment? Thanks! This will be huge! |
Only integers, but there's no real reason it can't work for strings |
I think it would be great if the I have included the generic the EXPLAIN ANALYZE results for the same query I was running below. |
They are now generated for strings since #15219 |
Generate In-Clause filters from hash joins (duckdb/duckdb#14864)
Generate In-Clause filters from hash joins (duckdb/duckdb#14864) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Follow-up from #12908
This PR extends the join filters that get generated by hash joins to include an
IN
filter when the hash table is small enough. Rather than generating only amin-max
filter, we generate anIN
filter with all values in the hash table. This can greatly improve performance over themin-max
filter when there are few values that are far apart. For example, if the hash table contains the values1
and100000
, theIN
filter might be much more effective as we can prune many more row groups.The threshold for which we generate an
IN
filter is determined by thedynamic_or_filter_threshold
setting, and defaults to 50 rows. TheIN
filter is pushed as anOPTIONAL_FILTER
, which is currently only evaluated for zone-map pruning. As a result, the performance impact of pushing this filter is minimal, while the performance improvement from extra zone-map pruning can be significant.Benchmark
Below is a benchmark that we run over TPC-H SF10 in which we run a join that is generated through an
IN
clause - we join on themin
andmax
values ofl_orderkey
(meaning the generated min/max filters will not be effective in pruning rows).