Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` #16161

Mytherin · 2025-02-10T16:05:49Z

The current Parquet reader uses the ColumnReader for two purposes:

Parsing schema metadata (getting the types/names and reading stats)
Reading the data

This PR splits up these two cases. The ParquetColumnSchema class is added which fulfills the first purpose. The new ParseSchema method looks at a Parquet schema and converts that into a ParquetColumnSchema. This is used in the ParquetReader to obtain the result types of the file.

The column readers are now created using the CreateReader method based on the ParquetColumnSchema objects. As a result, creating the column readers is significantly simpler (as the Parquet metadata has already been parsed at this stage). We can also avoid creating column readers for columns we are not going to read - which can lead to significant performance improvements when we are only reading a small subset of the columns when reading small-ish files.

…rquet Schema, the other to actually create the column readers)

…evel columns

#16194) #16161 added the ability for stats to be cast from `CastColumnReaders`. While this works, we can no longer do bloom filter look-ups through these casts (at least not without additional code to deal with the cast at this layer). Fixes the issue uncovered at #16191

Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` (duckdb/duckdb#16161) Removed the last CI job that used the Ubuntu 18 setup (duckdb/duckdb#16155) Hopefully fixing ci runs (duckdb/duckdb#16150) Add uniq_ptr_cast for interpreted benchmark. (duckdb/duckdb#16151)

… is FIXED_LEN_BYTE_ARRAY (#17723) Fixes a regression introduced in #16161 Type length may also be set for variable-length byte arrays (in which case it should be ignored).

Mytherin added 12 commits February 10, 2025 09:28

WIP: split CreateReader into separate methods (one for parsing the pa…

5ae9d06

…rquet Schema, the other to actually create the column readers)

Compiling again with many WIPs

a9b12e6

Column Reader split partially working

006fd66

Nested types working again

aacd291

Parquet metadata working again

68623a7

Put parquet schema into ParquetColumnData

465c1b2

Fix for schema evolution, add cast stats propagation for numerics

7ad7b2d

ParquetMetadata needs to return the underlying columns, not the top-l…

081c4cd

…evel columns

Check for duplicate file row number

8524ae8

Correctly differentiate between file index and schema index

705a7da

Rename file_index to column_index

4e2cfd3

Correctly return geometry type from geoparquet

897b20c

duckdb-draftbot marked this pull request as draft February 10, 2025 22:09

Mytherin marked this pull request as ready for review February 10, 2025 22:09

Mytherin merged commit 05e95a9 into duckdb:main Feb 11, 2025
49 checks passed

Mytherin mentioned this pull request Feb 11, 2025

Parquet reader: Avoid applying bloom filters if we are casting columns #16194

Merged

Mytherin deleted the splitcreatereader branch April 2, 2025 09:25

Mytherin mentioned this pull request May 30, 2025

Parquet Reader: only read strings as fixed length strings if the type is FIXED_LEN_BYTE_ARRAY #17723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` #16161

Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` #16161

Uh oh!

Mytherin commented Feb 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Parquet Reader: Split CreateReader into two separate stages - ParseSchema and CreateReader #16161

Parquet Reader: Split CreateReader into two separate stages - ParseSchema and CreateReader #16161

Uh oh!

Conversation

Mytherin commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` #16161

Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` #16161

Mytherin commented Feb 10, 2025 •

edited

Loading