Add `schema` parameter to `read_parquet` #9123

lnkuiper · 2023-09-27T08:12:42Z

This PR implements the schema parameter for read_parquet, which allows us to read a Parquet file as if it has the supplied schema. Field IDs are required. For example:

COPY (SELECT 42::INTEGER i) TO 'integers.parquet' (FIELD_IDS {i: 0});
SELECT *
FROM read_parquet('integers.parquet', schema=map {
                    0: {name: 'renamed_i', type: 'BIGINT', default_value: NULL},
                    1: {name: 'new_column', type: 'UTINYINT', default_value: 43}
                  });
-- ┌───────────┬────────────┐
-- │ renamed_i │ new_column │
-- │   int64   │   uint8    │
-- ├───────────┼────────────┤
-- │        42 │         43 │
-- └───────────┴────────────┘

Columns are identified by field id and can be added, deleted, reordered, renamed, and cast to a different type.

This parameter cannot be combined with union_by_name=true, and for now, it also cannot be combined with hive partitioning. Nested types are also not (yet) supported.

extension/parquet/parquet_reader.cpp

samansmink

Very cool! I added 2 minor comments, but in general it looks good!

extension/parquet/parquet_extension.cpp

test/parquet/test_parquet_schema.test

lnkuiper · 2023-09-27T10:50:33Z

Thanks for the feedback! I've added the tests, and disabled auto-detection of hive partitioning when the schema parameter is used

samansmink · 2023-09-27T11:02:10Z

LGTM!

samansmink · 2023-09-29T12:45:41Z

woops, can I still undo declaring LGTM? I have started experimenting with this feature in the iceberg extension and ran into the following test failure:

query I
SELECT count(*)
FROM read_parquet('__TEST_DIR__/integers.parquet', schema=map {
                    0: {name: 'renamed_i', type: 'BIGINT', default_value: NULL},
                    1: {name: 'new_column', type: 'UTINYINT', default_value: 43}
                  })
----
2

fails with

Actual result:
INTERNAL Error: Attempted to access index 1 within vector of size 1

edit: i think the solution is quite simple, we can have a look together @lnkuiper
edit2: i think I have a fix in https://github.com/samansmink/duckdb/tree/parquet-schema-fix which i branched off this PR

…o parquet_schema

lnkuiper · 2023-10-04T07:26:26Z

Thanks for the feedback @Tishj and for finding the bug @samansmink ! I have updated the PR and pulled Sam's fix into it.

lnkuiper · 2023-10-04T12:53:21Z

I think this is ready to go!

Mytherin · 2023-10-05T17:02:32Z

Thanks!

Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg

…bal column (#15446) This PR essentially moves the specialized code that was already in the parquet extension for matching on `field_id`, added by <#9123>, into the MultiFileReader. It also makes it possible to map a local (per-file) column name to a different global name. To do this we bundle the type+name into a struct (`MultiFileReaderColumnDefinition`), where we can also bundle additional metadata like default values, and the parquet field_id

lnkuiper added 11 commits September 18, 2023 16:20

first steps schema argument for parquet files

7a169bc

Merge branch 'main' into parquet_schema

e5db8d0

some progress on parquet schema

3c62835

Merge branch 'main' into parquet_schema

843755f

Merge branch 'main' into parquet_schema

a45f6be

parquet schema progress

1d2044f

Merge branch 'main' into parquet_schema

b2066af

parquet schema working end-to-end now

6db4d61

check isset and add another test

299337e

Merge branch 'main' into parquet_schema

0d2f941

more tests for parquet schema and some code cleanup

64e58e6

lnkuiper requested a review from samansmink September 27, 2023 08:12

Tishj reviewed Sep 27, 2023

View reviewed changes

extension/parquet/parquet_reader.cpp Outdated Show resolved Hide resolved

samansmink reviewed Sep 27, 2023

View reviewed changes

extension/parquet/parquet_extension.cpp Show resolved Hide resolved

test/parquet/test_parquet_schema.test Show resolved Hide resolved

Mytherin changed the base branch from main to feature September 27, 2023 09:09

implement pr feedback

21a3076

github-actions bot marked this pull request as draft September 27, 2023 10:33

lnkuiper marked this pull request as ready for review September 27, 2023 10:50

Mytherin added the feature label Sep 27, 2023

samansmink and others added 5 commits September 29, 2023 16:12

fix projection pushdown and count(*)

d302eb4

Merge branch 'feature' of github.com:duckdb/duckdb into feature

4bd6cee

Merge branch 'feature' into parquet_schema

98170fb

implement thijs feedback

218a1a2

Merge branch 'parquet-schema-fix' of github.com:samansmink/duckdb int…

aa06cdd

…o parquet_schema

github-actions bot marked this pull request as draft October 4, 2023 07:25

lnkuiper marked this pull request as ready for review October 4, 2023 07:29

Mytherin changed the base branch from feature to main October 5, 2023 16:59

Mytherin changed the base branch from main to feature October 5, 2023 16:59

Mytherin merged commit c3aa759 into duckdb:feature Oct 5, 2023

Mytherin mentioned this pull request Oct 6, 2023

union_by_name is vastly slower/fails over httpfs #8018

Closed

2 tasks

samansmink mentioned this pull request Oct 10, 2023

Allow file_row_number with parquet schema option #9290

Merged

lnkuiper deleted the parquet_schema branch November 24, 2023 13:36

samansmink mentioned this pull request Nov 27, 2023

Parse schema from table metadata duckdb/duckdb-iceberg#30

Merged

This was referenced Dec 17, 2024

provide ability to specify schema while reading parquet #15380

Closed

[MultiFileReader] Extend support for column mapping from local -> global column #15446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `schema` parameter to `read_parquet` #9123

Add `schema` parameter to `read_parquet` #9123

Uh oh!

lnkuiper commented Sep 27, 2023

Uh oh!

Uh oh!

samansmink left a comment

Uh oh!

Uh oh!

Uh oh!

lnkuiper commented Sep 27, 2023

Uh oh!

samansmink commented Sep 27, 2023

Uh oh!

samansmink commented Sep 29, 2023 •

edited

Loading

Uh oh!

lnkuiper commented Oct 4, 2023

Uh oh!

lnkuiper commented Oct 4, 2023

Uh oh!

Mytherin commented Oct 5, 2023

Uh oh!

Uh oh!

Add schema parameter to read_parquet #9123

Add schema parameter to read_parquet #9123

Uh oh!

Conversation

lnkuiper commented Sep 27, 2023

Uh oh!

Uh oh!

samansmink left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lnkuiper commented Sep 27, 2023

Uh oh!

samansmink commented Sep 27, 2023

Uh oh!

samansmink commented Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lnkuiper commented Oct 4, 2023

Uh oh!

lnkuiper commented Oct 4, 2023

Uh oh!

Mytherin commented Oct 5, 2023

Uh oh!

Uh oh!

Add `schema` parameter to `read_parquet` #9123

Add `schema` parameter to `read_parquet` #9123

samansmink commented Sep 29, 2023 •

edited

Loading