Overhaul Parquet dictionary handling #14194

hannes · 2024-10-02T08:50:09Z

This PR changes the way we handle dictionary-encoded data from Parquet files. Previously, we would use a type-specific handling of dictionaries, e.g. an array of string_t for strings or just a blob for integers.

With this PR, we change this to a generic solution where the dictionary is read using the "plain" reading infrastructure into a DuckDB Vector. Later, we turn the offsets into this dictionary into a selection vector and Slice the dictionary vector. This has the nice effect that Parquet dictionary references are turned into DuckDB dictionary vectors. In addition, we add generic bounds checks for dictionary offsets to ensure no crashes can occur from corrupt Parquet files.

This PR also adds a benchmark that shows that handling dictionaries is actually slightly faster with this PR, especially for large dictionaries.

… reader to read dictionary entries, turn dictionary in a vector, and just emit dictionary vectors on top of that as results

hannes · 2024-10-02T19:05:20Z

Not sure if this is feature or main just yet

This reverts commit 8e68538, reversing changes made to 92cf6f1.

Fix an issue when reading from partial pages with the new dictionary handling added in #14194

hannes added 5 commits October 1, 2024 13:25

overhauling dictionary reads in parquet reader, now use generic Plain…

2c2c582

… reader to read dictionary entries, turn dictionary in a vector, and just emit dictionary vectors on top of that as results

simplify conversion, no result offset

de262a6

added benchmark

7ea7ff4

some performance tweaks

671f282

off-by-one

5117821

duckdb-draftbot marked this pull request as draft October 2, 2024 09:18

hannes marked this pull request as ready for review October 2, 2024 09:20

making parquet filter work with unified format

9416e3a

duckdb-draftbot marked this pull request as draft October 2, 2024 12:08

hannes marked this pull request as ready for review October 2, 2024 12:08

hannes added the Ready For Review label Oct 4, 2024

hannes changed the base branch from main to feature October 7, 2024 08:24

hannes merged commit 8e68538 into duckdb:feature Oct 7, 2024
44 checks passed

samansmink added a commit to samansmink/duckdb that referenced this pull request Oct 17, 2024

Revert "Overhaul Parquet dictionary handling (duckdb#14194)"

46003db

This reverts commit 8e68538, reversing changes made to 92cf6f1.

hannes mentioned this pull request Oct 18, 2024

Fixing an issue with parquet dictionary reading #14438

Merged

hannes added a commit that referenced this pull request Oct 22, 2024

Fixing an issue with parquet dictionary reading (#14438)

d60103a

Fix an issue when reading from partial pages with the new dictionary handling added in #14194

hannes deleted the parquetdictoverhaul branch October 29, 2024 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overhaul Parquet dictionary handling #14194

Overhaul Parquet dictionary handling #14194

Uh oh!

hannes commented Oct 2, 2024 •

edited

Loading

Uh oh!

hannes commented Oct 2, 2024

Uh oh!

Uh oh!

Uh oh!

Overhaul Parquet dictionary handling #14194

Overhaul Parquet dictionary handling #14194

Uh oh!

Conversation

hannes commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hannes commented Oct 2, 2024

Uh oh!

Uh oh!

Uh oh!

hannes commented Oct 2, 2024 •

edited

Loading