Or filter pushdown into zone maps #14313

Tmonster · 2024-10-10T07:18:42Z

This PR will create a zone map table filter when a OR/IN filter on an integer column is present above a table scan.

When are OR/IN filters are pushed down?
If the column is an integer type column. Based on some benchmarks, the overhead of checking the min max for integer columns when the values are unordered or distinct values are sparse is very little compared to the benefit you get when they are ordered and/order the distinct values are dense.

See the following results. I test or filter pushdown on the following columsn

a column where every value is repeated 4 times on average (i.e 25% distinct values) using l_orderkey of lineitem.
a column where every value is distinct. orders.o_orderkey.

I test on ordered column values and unordered column values.

I select just two values around the minimum + ~(rowgroup size) and the maximum - ~(row group size). This way at least two row groups are emitted from the scan. I query on tpch datasets at sf1, sf10, sf100. Notice that pushing down the OR filter on an unordered version of lineitem is only ~0.1 second slower than not pushing down. This comes from the min max checks. Inspecting the explain analyze by hand, the zone map filter lets in every row group on the higher scale factors

Foreign Key Data (lineitem.l_orderkey)

Execution times to find 2 values in the lineitem.l_orderkey column. This benchmark is included in the PR and was performed on a c6id.8xlarge (32 cores, 64 GB memory).
select * from {lineitem_ordered/random_sfXXX} where l_orderkey in (99584, 5900006);

lineitem.l_orderkey	pushdown ordered	pushdown random	no pushdown ordered	no pushdown random
sf=1	0.008176	0.027871	0.026316	0.026233
sf=10	0.008149	0.217365	0.201166	0.201670
sf=100	0.009608	2.041269	1.935285	1.931140

Primary key data.(orders.o_orderkey)

Execution times to find 2 values in the orders.o_orderkey column. Here every o_orderkey is distinct. However, the two values selected mean that in the random order case, all values are propagated through to the FILTER.
select * from {orders_ordered/random_sfXXX} where o_orderkey in (99584, 5900006);

orders.o_orderkey	pushdown ordered	pushdown random	no pushdown ordered	no pushdown random
sf=1	0.011753	0.012705	0.021179	0.021014
sf=10	0.011964	0.085094	0.083017	0.082885
sf=100	0.012428	0.804672	0.792688	0.787496

Future work:

Investigate a better way to pushdown OR/IN varcher filters into table scans.
Investigate a way to pushdown OR/IN filters into DATE types
Could zonemap filter pushdown work accross multiple columns (a = 5 OR b = 9)
When Bloom filters are available, evaluate the OR filter on the bloom filter first

TODO: Add more tests

…h a zone map down

Mytherin

Thanks for the PR! Looks great - some comments from my side:

Mytherin · 2024-10-11T10:43:17Z

src/planner/filter/zone_map_filter.cpp

+FilterPropagateResult ZoneMapFilter::CheckStatistics(BaseStatistics &stats) {
+	if (child_filter->filter_type == TableFilterType::CONSTANT_COMPARISON) {
+		auto &const_compare = child_filter->Cast<ConstantFilter>();
+		if (const_compare.constant.type().IsTemporal()) {


Is this required? Can we make this work for all types?

Mytherin · 2024-10-11T10:45:30Z

src/optimizer/filter_combiner.cpp

+						break;
+					}
+					auto const_filter = make_uniq<ConstantFilter>(comp.type, const_val->value);
+					zone_filter = make_uniq<ZoneMapFilter>();


Instead of creating ZoneMapFilter(x) OR ZoneMapFilter(y), can we make the child of a ZoneMapFilter the OR expression - i.e. ZoneMapFilter(x OR y)? That should allow more efficient skipping.

Mytherin · 2024-10-11T10:48:57Z

src/planner/filter/zone_map_filter.cpp

+
+namespace duckdb {
+
+ZoneMapFilter::ZoneMapFilter() : TableFilter(TableFilterType::ZONE_MAP) {


Let's rename the ZoneMapFilter to an OptionalFilter - indicating the true meaning of this type of filter, namely that the table scan is allowed to filter rows based on this criteria, but not required, as opposed to our other table filters that actually enforce that the rows must be removed for correctness.

Mytherin · 2024-10-11T10:49:57Z

src/planner/filter/zone_map_filter.cpp

+
+string ZoneMapFilter::ToString(const string &column_name) {
+	D_ASSERT(child_filter->filter_type == TableFilterType::CONSTANT_COMPARISON);
+	auto const_filter = child_filter->Cast<ConstantFilter>();


Can we make this work for all table filters?

Mytherin · 2024-10-11T10:51:46Z

src/storage/table/column_segment.cpp

@@ -398,6 +398,10 @@ idx_t ColumnSegment::FilterSelection(SelectionVector &sel, Vector &vector, Unifi
 		SelectionVector result_sel(approved_tuple_count);
 		auto &conjunction_or = filter.Cast<ConjunctionOrFilter>();
 		for (auto &child_filter : conjunction_or.child_filters) {
+			if (child_filter->filter_type == TableFilterType::ZONE_MAP) {
+				// conjunction OR of zone map is handed in row_group.cpp.
+				return scan_count;


This should not be required here

Mytherin · 2024-10-15T08:36:40Z

Thanks! Looks great.

Tmonster added 30 commits October 9, 2024 09:56

I thought I had it, but I dont think so

df6b904

still tyring to figure this out, not seeing all of hte row groups

b7a52f8

updates

2ddfd75

almost. but need a zone map conjunction filter

4abf2dc

some updates

5f2cadc

add serialization, only add zone map or filters if column id is set

871b73c

fix more failing tests

5d5b5d4

fix more tests, need to also manage cases with hive partitioning

bf40170

fixed last test

b474c99

fix tidy checks

70f0468

tidy fix again

533596b

in filters are also now or filter pushed down

29bce33

make format-fix for large filter pushdown

be674cb

fix parquet or filter pushdown

bede7af

move header

1eec99c

move return true back to original location

328dec2

don't run the filter if the table is not large enough

c6ea2d6

do not convert in () to or when they types are strings

a5b552e

fix some clang tidy

9a84516

added a new comment

f0b2df9

disable in filter pushdown on string types

a81655c

add back in check for statistics

0f52ecf

remove unused enum

0e36dfd

you can pushdown on VARCHARS as well now

cca3ed0

fix filterPropagate result return value and threshold for when to pus…

b9f29c8

…h a zone map down

tidy check fixes

0f5e046

another tidy fix

ffa1338

add imdb test

ca86937

remove unnecessary test file

5d21810

should fix regression issue

54ed2b9

Mytherin reviewed Oct 11, 2024

View reviewed changes

Tmonster added 14 commits October 11, 2024 14:29

implement PR comments

210726b

more PR comments

0d5a1af

remove unused code

17c15d7

code clean up

efc4c32

hopefully fix segfault

c20d26c

maybe fix windows build?

19684ea

Merge branch 'feature' into or_filter_pushdown_zone_map_feature

b41f15a

see if this fixes windows

208ffbe

regenerate files

96e01ba

fix serialization

34d916d

fix python build

9f58ae5

require no froce storarge

54eed9d

mmake format-fix

f2fc8be

Merge branch 'feature' into or_filter_pushdown_zone_map_feature

b1bb85f

Tmonster marked this pull request as ready for review October 14, 2024 08:45

remove optional filters from substrait

d478f09

duckdb-draftbot marked this pull request as draft October 14, 2024 11:22

APPLY PATCHES -> APPLY_PATCHES

c50e1f4

Tmonster marked this pull request as ready for review October 14, 2024 13:27

Tmonster added 2 commits October 14, 2024 15:58

fix patch

f2eeacc

remove no_pushdown benchmarks

75edb6d

duckdb-draftbot marked this pull request as draft October 14, 2024 14:00

Tmonster marked this pull request as ready for review October 14, 2024 14:39

Mytherin merged commit 842572c into duckdb:feature Oct 15, 2024
37 of 38 checks passed

This was referenced Nov 18, 2024

When reading parquet, filter using IN () on two values much slower than equals on single value #4295

Closed

can't push down the 'or'/'in' filter #14760

Closed

zheniasigayev mentioned this pull request Dec 6, 2024

Generate In-Clause filters from hash joins #14864

Merged

renovate bot mentioned this pull request Feb 15, 2025

fix(deps): update all-minor-patch elsbrock/hetzner-radar#118

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Or filter pushdown into zone maps #14313

Or filter pushdown into zone maps #14313

Uh oh!

Tmonster commented Oct 10, 2024 •

edited

Loading

Uh oh!

Mytherin left a comment

Uh oh!

Mytherin Oct 11, 2024

Uh oh!

Mytherin Oct 11, 2024

Uh oh!

Mytherin Oct 11, 2024

Uh oh!

Mytherin Oct 11, 2024

Uh oh!

Mytherin Oct 11, 2024

Uh oh!

Uh oh!

Mytherin commented Oct 15, 2024

Uh oh!

Uh oh!


		namespace duckdb {

		ZoneMapFilter::ZoneMapFilter() : TableFilter(TableFilterType::ZONE_MAP) {

Or filter pushdown into zone maps #14313

Or filter pushdown into zone maps #14313

Uh oh!

Conversation

Tmonster commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Mytherin Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Mytherin Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Mytherin Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Mytherin Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Mytherin Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mytherin commented Oct 15, 2024

Uh oh!

Uh oh!

Tmonster commented Oct 10, 2024 •

edited

Loading