Skip to content

Conversation

hannes
Copy link
Member

@hannes hannes commented Oct 2, 2024

This PR changes the way we handle dictionary-encoded data from Parquet files. Previously, we would use a type-specific handling of dictionaries, e.g. an array of string_t for strings or just a blob for integers.

With this PR, we change this to a generic solution where the dictionary is read using the "plain" reading infrastructure into a DuckDB Vector. Later, we turn the offsets into this dictionary into a selection vector and Slice the dictionary vector. This has the nice effect that Parquet dictionary references are turned into DuckDB dictionary vectors. In addition, we add generic bounds checks for dictionary offsets to ensure no crashes can occur from corrupt Parquet files.

This PR also adds a benchmark that shows that handling dictionaries is actually slightly faster with this PR, especially for large dictionaries.

hannes added 5 commits October 1, 2024 13:25
… reader to read dictionary entries, turn dictionary in a vector, and just emit dictionary vectors on top of that as results
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 2, 2024 09:18
@hannes hannes marked this pull request as ready for review October 2, 2024 09:20
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 2, 2024 12:08
@hannes hannes marked this pull request as ready for review October 2, 2024 12:08
@hannes
Copy link
Member Author

hannes commented Oct 2, 2024

Not sure if this is feature or main just yet

@hannes hannes changed the base branch from main to feature October 7, 2024 08:24
@hannes hannes merged commit 8e68538 into duckdb:feature Oct 7, 2024
44 checks passed
samansmink added a commit to samansmink/duckdb that referenced this pull request Oct 17, 2024
This reverts commit 8e68538, reversing
changes made to 92cf6f1.
hannes added a commit that referenced this pull request Oct 22, 2024
Fix an issue when reading from partial pages with the new dictionary
handling added in #14194
@hannes hannes deleted the parquetdictoverhaul branch October 29, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant