-
Notifications
You must be signed in to change notification settings - Fork 2.5k
allow external cardinality information (e.g. from iceberg) #14292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow external cardinality information (e.g. from iceberg) #14292
Conversation
…dinality" this ubigint sthe exact cardinality of the parquet_scan(), which can span multiple files that one may know from external metadata like the "schema" named_parameter, it is not really public, but intended for duckdb_iceberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay to add, it is is pretty low profile and will improve iceberg queries quite a bit once it uses this.
However, ideally we still just switch iceberg to the MultiFileReader approach unifying delta and iceberg and avoid needing this parameter altogether.
the MultiFileReader approach will an improvement, but it will happen in the longer term, so I think this still is useful for merge. |
@@ -70,8 +70,8 @@ struct ParquetReadBindData : public TableFunctionData { | |||
// These come from the initial_reader, but need to be stored in case the initial_reader is removed by a filter | |||
idx_t initial_file_cardinality; | |||
idx_t initial_file_row_groups; | |||
idx_t explicit_cardinality; // can be set to inject exterior knowledge on total cardinality (e.g. from a data lake) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explicitly intitialize this with 0? It seems this is currently uninitialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, sorry I thought the whole struct was initialized 0
…cz/duckdb into pb/explicit-parquet-cardinality
Thanks! |
allow external cardinality information (e.g. from iceberg) (duckdb/duckdb#14292)
allow external cardinality information (e.g. from iceberg) (duckdb/duckdb#14292) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
add a new named_parameter for the parquet_scan() called "explicit_cardinality"
this ubigint is the exact cardinality of the parquet_scan(), which can span multiple files that one may know from external metadata
like the "schema" named_parameter, it is not really public, but intended for duckdb_iceberg