Skip to content

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Feb 10, 2025

The current Parquet reader uses the ColumnReader for two purposes:

  • Parsing schema metadata (getting the types/names and reading stats)
  • Reading the data

This PR splits up these two cases. The ParquetColumnSchema class is added which fulfills the first purpose. The new ParseSchema method looks at a Parquet schema and converts that into a ParquetColumnSchema. This is used in the ParquetReader to obtain the result types of the file.

The column readers are now created using the CreateReader method based on the ParquetColumnSchema objects. As a result, creating the column readers is significantly simpler (as the Parquet metadata has already been parsed at this stage). We can also avoid creating column readers for columns we are not going to read - which can lead to significant performance improvements when we are only reading a small subset of the columns when reading small-ish files.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 10, 2025 22:09
@Mytherin Mytherin marked this pull request as ready for review February 10, 2025 22:09
@Mytherin Mytherin merged commit 05e95a9 into duckdb:main Feb 11, 2025
49 checks passed
Mytherin added a commit that referenced this pull request Feb 12, 2025
#16194)

#16161 added the ability for stats
to be cast from `CastColumnReaders`. While this works, we can no longer
do bloom filter look-ups through these casts (at least not without
additional code to deal with the cast at this layer).

Fixes the issue uncovered at #16191
Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Feb 27, 2025
Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` (duckdb/duckdb#16161)
Removed the last CI job that used the Ubuntu 18 setup (duckdb/duckdb#16155)
Hopefully fixing ci runs (duckdb/duckdb#16150)
Add uniq_ptr_cast for interpreted benchmark. (duckdb/duckdb#16151)
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025
Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` (duckdb/duckdb#16161)
Removed the last CI job that used the Ubuntu 18 setup (duckdb/duckdb#16155)
Hopefully fixing ci runs (duckdb/duckdb#16150)
Add uniq_ptr_cast for interpreted benchmark. (duckdb/duckdb#16151)
@Mytherin Mytherin deleted the splitcreatereader branch April 2, 2025 09:25
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` (duckdb/duckdb#16161)
Removed the last CI job that used the Ubuntu 18 setup (duckdb/duckdb#16155)
Hopefully fixing ci runs (duckdb/duckdb#16150)
Add uniq_ptr_cast for interpreted benchmark. (duckdb/duckdb#16151)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` (duckdb/duckdb#16161)
Removed the last CI job that used the Ubuntu 18 setup (duckdb/duckdb#16155)
Hopefully fixing ci runs (duckdb/duckdb#16150)
Add uniq_ptr_cast for interpreted benchmark. (duckdb/duckdb#16151)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` (duckdb/duckdb#16161)
Removed the last CI job that used the Ubuntu 18 setup (duckdb/duckdb#16155)
Hopefully fixing ci runs (duckdb/duckdb#16150)
Add uniq_ptr_cast for interpreted benchmark. (duckdb/duckdb#16151)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Parquet Reader: Split `CreateReader` into two separate stages - `ParseSchema` and `CreateReader` (duckdb/duckdb#16161)
Removed the last CI job that used the Ubuntu 18 setup (duckdb/duckdb#16155)
Hopefully fixing ci runs (duckdb/duckdb#16150)
Add uniq_ptr_cast for interpreted benchmark. (duckdb/duckdb#16151)
Mytherin added a commit that referenced this pull request May 30, 2025
… is FIXED_LEN_BYTE_ARRAY (#17723)

Fixes a regression introduced in
#16161

Type length may also be set for variable-length byte arrays (in which
case it should be ignored).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant