-
Notifications
You must be signed in to change notification settings - Fork 2.6k
python: Add missing global options to write_csv #10382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python: Add missing global options to write_csv #10382
Conversation
Thanks for the PR! Looks good to me. Could you just merge with master so the CI can re-run? @Tishj can you also have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the options, could you add some tests for the options?
The only thing we really need to verify with these tests is that they are passed correctly to the csv writing method.
But that will properly mean we need to check for the effects of the options on the resulting CSV
There exist a couple of tests for the existing option: tools/pythonpkg/tests/fast/api/test_to_csv.py
hive_partitioning argument also needs to be added to read_csv so it's fully Python API compatible. Also need some clarification on how the overwrite argument is supposed to work
For overwrite, per_thread_output, use_tmp_file
Added some tests. Note that for "partition_by", reopening is done through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM now 👍
Merge pull request duckdb/duckdb#11461 from lnkuiper/parquet_dict_early_out Merge pull request duckdb/duckdb#11137 from Tishj/sqllogic_parser Merge pull request duckdb/duckdb#11095 from Tishj/python_struct_child_count_mismatch Merge pull request duckdb/duckdb#10382 from jzavala-gonzalez/python-write-csv-options
Thank you!! |
This PR adds more options to DuckDB's Python Relational API for `write_parquet`, matching the `COPY TO` options, addressing #8896: - `partition_by` - `write_partition_columns` - `overwrite` - `per_thread_output` - `use_tmp_file` - `append` I would also like to note that the `overwrite` option that was added in the `to_csv` function (#10382) technically passes `overwrite_or_ignore` to the underlying engine: https://github.com/duckdb/duckdb/blob/fd5de0607d7ab5bdddad62cc1a0225be72dff967/tools/pythonpkg/src/pyrelation.cpp#L1291-L1296 In order to match this behavior, I've also implemented it the same way. Changing it to pass `overwrite` and introducing `overwrite_or_ignore` as an option would be a [breaking change](https://duckdb.org/docs/sql/statements/copy.html#copy--to-options), thus I've avoided doing it. I've also improved the `test_to_parquet` tests by introducing new tests for the above mentioned flags, as well as parameterizing the Pandas engine (similar to the `test_to_csv` tests – using both `NumpyPandas` and `ArrowPandas`). This PR also makes the Python stubs for `{to,write}_{csv,parquet}` both match, as they are technically aliases.
Addresses one of the points in #8896 .
Adds the following parameters to write_csv:
Stubs are also edited to include said parameters.
Relevant docs: https://duckdb.org/docs/sql/statements/copy.html#copy--to-options