-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add schema
parameter to read_parquet
#9123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! I added 2 minor comments, but in general it looks good!
Thanks for the feedback! I've added the tests, and disabled auto-detection of hive partitioning when the schema parameter is used |
LGTM! |
woops, can I still undo declaring LGTM? I have started experimenting with this feature in the iceberg extension and ran into the following test failure:
fails with
edit: i think the solution is quite simple, we can have a look together @lnkuiper |
Thanks for the feedback @Tishj and for finding the bug @samansmink ! I have updated the PR and pulled Sam's fix into it. |
I think this is ready to go! |
Thanks! |
Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg
…bal column (#15446) This PR essentially moves the specialized code that was already in the parquet extension for matching on `field_id`, added by <#9123>, into the MultiFileReader. It also makes it possible to map a local (per-file) column name to a different global name. To do this we bundle the type+name into a struct (`MultiFileReaderColumnDefinition`), where we can also bundle additional metadata like default values, and the parquet field_id
This PR implements the
schema
parameter forread_parquet
, which allows us to read a Parquet file as if it has the supplied schema. Field IDs are required. For example:Columns are identified by field id and can be added, deleted, reordered, renamed, and cast to a different type.
This parameter cannot be combined with
union_by_name=true
, and for now, it also cannot be combined with hive partitioning. Nested types are also not (yet) supported.