Skip to content

Conversation

Mytherin
Copy link
Collaborator

Follow-up from #12908

This PR extends the join filters that get generated by hash joins to include an IN filter when the hash table is small enough. Rather than generating only a min-max filter, we generate an IN filter with all values in the hash table. This can greatly improve performance over the min-max filter when there are few values that are far apart. For example, if the hash table contains the values 1 and 100000, the IN filter might be much more effective as we can prune many more row groups.

The threshold for which we generate an IN filter is determined by the dynamic_or_filter_threshold setting, and defaults to 50 rows. The IN filter is pushed as an OPTIONAL_FILTER, which is currently only evaluated for zone-map pruning. As a result, the performance impact of pushing this filter is minimal, while the performance improvement from extra zone-map pruning can be significant.

Benchmark

Below is a benchmark that we run over TPC-H SF10 in which we run a join that is generated through an IN clause - we join on the min and max values of l_orderkey (meaning the generated min/max filters will not be effective in pruning rows).

SELECT *
FROM lineitem
WHERE l_orderkey IN (
    SELECT UNNEST([MIN(l_orderkey), MAX(l_orderkey)])
    FROM lineitem)
ORDER BY ALL;
v1.1 New
0.22s 0.04s

@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 18, 2024 08:37
@Mytherin Mytherin marked this pull request as ready for review November 18, 2024 08:41
@Mytherin Mytherin merged commit b470dea into duckdb:main Nov 18, 2024
42 checks passed
Mytherin added a commit that referenced this pull request Nov 19, 2024
After looking at #14864, I realized
we aren't handling optional filters in arrow scans very well. We filter
them out before transform filter, when they should be filtered out in
transform filter. Since InFilters are pushed as children of optional
filters, we should not check them in Transform Filter. The switch should
be a white-list for what is allowed, and the default case should throw
at every other filter type.

Also added tests to catch In filter pushdown. Both from `IN` and from
joins.
@Alex-Monahan
Copy link
Contributor

Small question here! My ability to read C++ is still limited... ;-)

Does this PR enable OR filter pushdown for non-integer columns like strings and floats? Or is it integers only at the moment?

Thanks! This will be huge!

@Mytherin
Copy link
Collaborator Author

Mytherin commented Dec 4, 2024

Only integers, but there's no real reason it can't work for strings

@zheniasigayev
Copy link

I think it would be great if the IN operator would work for string types used in IN filters. I saw the comment above and the PRs #14313, and #14864 which indicate "This PR will create a zone map table filter when a OR/IN filter on an integer column is present above a table scan."

I have included the generic the EXPLAIN ANALYZE results for the same query I was running below.

image

@Mytherin Mytherin deleted the joinpushorfilter branch December 8, 2024 06:51
@Mytherin
Copy link
Collaborator Author

They are now generated for strings since #15219

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 24, 2024
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 24, 2024
Generate In-Clause filters from hash joins (duckdb/duckdb#14864)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants