Overhaul Parquet dictionary handling #14194
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the way we handle dictionary-encoded data from Parquet files. Previously, we would use a type-specific handling of dictionaries, e.g. an array of string_t for strings or just a blob for integers.
With this PR, we change this to a generic solution where the dictionary is read using the "plain" reading infrastructure into a DuckDB Vector. Later, we turn the offsets into this dictionary into a selection vector and Slice the dictionary vector. This has the nice effect that Parquet dictionary references are turned into DuckDB dictionary vectors. In addition, we add generic bounds checks for dictionary offsets to ensure no crashes can occur from corrupt Parquet files.
This PR also adds a benchmark that shows that handling dictionaries is actually slightly faster with this PR, especially for large dictionaries.