Skip to content

Conversation

lnkuiper
Copy link
Contributor

@lnkuiper lnkuiper commented Dec 9, 2024

This was only implemented for integral types, so this PR adds support for strings (also works for dynamic join filter pushdown!).

To prevent regressions in IMDB, I had to add a special case for OPTIONAL_FILTERs in row_group.cpp so that these are skipped for scanning vectors (they are only used for row group checking anyway).

@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 9, 2024 13:39
@lnkuiper lnkuiper marked this pull request as ready for review December 9, 2024 13:39
@Mytherin
Copy link
Collaborator

Mytherin commented Dec 9, 2024

Thanks for the PR! Great change.

Could we add a benchmark that showcases the performance improvement?

I don't see any changes related to the dynamic filter pushdown - did that already work previously or am I missing something?

@lnkuiper
Copy link
Contributor Author

lnkuiper commented Dec 9, 2024

@Mytherin Thanks for the review! I've added a few benchmark files that test the behavior:

(my-venv) [laurens@mac] ~/git/duckdb/duckdb (or_varchar*)
$ ./build/release/benchmark/benchmark_runner benchmark/micro/pushdown/or_varchar_pushdown.benchmark
name	run	timing
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	1	0.149487
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	2	0.152318
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	3	0.150304
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	4	0.154057
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	5	0.156211
(my-venv) [laurens@mac] ~/git/duckdb/duckdb (or_varchar*)
$ ./build/release/benchmark/benchmark_runner benchmark/micro/pushdown/in_varchar_pushdown.benchmark 
name	run	timing
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	1	0.149813
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	2	0.150118
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	3	0.151343
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	4	0.151153
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	5	0.152979
(my-venv) [laurens@mac] ~/git/duckdb/duckdb (or_varchar*)
$ ./build/release/benchmark/benchmark_runner benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark 
name	run	timing
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	1	0.260337
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	2	0.260146
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	3	0.259904
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	4	0.263196
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	5	0.261264
(my-venv) [laurens@mac] ~/git/duckdb/duckdb (or_varchar*)
$ ./build/reldebug/benchmark/benchmark_runner benchmark/micro/pushdown/or_varchar_pushdown.benchmark
name	run	timing
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	1	0.001852
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	2	0.002059
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	3	0.001909
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	4	0.001988
benchmark/micro/pushdown/or_varchar_pushdown.benchmark	5	0.001868
(my-venv) [laurens@mac] ~/git/duckdb/duckdb (or_varchar*)
$ ./build/reldebug/benchmark/benchmark_runner benchmark/micro/pushdown/in_varchar_pushdown.benchmark 
name	run	timing
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	1	0.001907
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	2	0.001908
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	3	0.001958
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	4	0.002131
benchmark/micro/pushdown/in_varchar_pushdown.benchmark	5	0.002057
(my-venv) [laurens@mac] ~/git/duckdb/duckdb (or_varchar*)
$ ./build/reldebug/benchmark/benchmark_runner benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark 
name	run	timing
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	1	0.258381
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	2	0.257439
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	3	0.256559
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	4	0.257359
benchmark/micro/pushdown/join_filter_varchar_pushdown.benchmark	5	0.259957

You are right, the dynamic filter pushdown was working before - this confirms it (it was >2x slower in v1.1.3).

@Mytherin
Copy link
Collaborator

Mytherin commented Dec 9, 2024

Looks great - only the benchmarks don't seem to be added (likely by accident?)

@duckdb-draftbot duckdb-draftbot marked this pull request as draft December 9, 2024 16:54
@lnkuiper lnkuiper marked this pull request as ready for review December 9, 2024 17:51
@arjenpdevries
Copy link
Contributor

Anecdotal evidence from a single query in text retrieval context: SELECT termid, df FROM dict WHERE term IN ( 'radboud', 'university'); has become 3x as fast with this pull request. Before, I had to turn the query into a union by term before getting acceptable timings, now I can just use the query as it should be. Thanks!

@lnkuiper
Copy link
Contributor Author

@Mytherin Forgot to push those yesterday. They are in now

@Mytherin Mytherin merged commit ce58704 into duckdb:main Dec 10, 2024
46 of 49 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

@Alex-Monahan
Copy link
Contributor

Amazing!!!! Thanks folks!!! This will be so helpful for the crazy queries out there...

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 29, 2024
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 29, 2024
`OR`/`IN` filter pushdown for `VARCHAR` (duckdb/duckdb#15219)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
@lnkuiper lnkuiper deleted the or_varchar branch April 14, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants