python: Add missing global options to write_csv #10382

jzavala-gonzalez · 2024-01-30T00:26:11Z

Addresses one of the points in #8896 .
Adds the following parameters to write_csv:

overwrite
per_thread_output
use_tmp_file
partition_by

Stubs are also edited to include said parameters.

Relevant docs: https://duckdb.org/docs/sql/statements/copy.html#copy--to-options

Mytherin · 2024-02-07T08:44:43Z

Thanks for the PR! Looks good to me. Could you just merge with master so the CI can re-run? @Tishj can you also have a look?

Tishj

Thanks for adding the options, could you add some tests for the options?
The only thing we really need to verify with these tests is that they are passed correctly to the csv writing method.

But that will properly mean we need to check for the effects of the options on the resulting CSV

There exist a couple of tests for the existing option: tools/pythonpkg/tests/fast/api/test_to_csv.py

hive_partitioning argument also needs to be added to read_csv so it's fully Python API compatible. Also need some clarification on how the overwrite argument is supposed to work

For overwrite, per_thread_output, use_tmp_file

jzavala-gonzalez · 2024-03-01T21:48:51Z

Added some tests. Note that for "partition_by", reopening is done through duckdb.sql since read_csv does not yet implement the hive_partitioning argument. "overwrite" is tested both that it works when enabled and throws when it isn't. "per_thread_output" and "use_tmp_file" tests both just check that the resulting file can be opened successfully.

Tishj

Thanks, LGTM now 👍

Merge pull request duckdb/duckdb#11461 from lnkuiper/parquet_dict_early_out Merge pull request duckdb/duckdb#11137 from Tishj/sqllogic_parser Merge pull request duckdb/duckdb#11095 from Tishj/python_struct_child_count_mismatch Merge pull request duckdb/duckdb#10382 from jzavala-gonzalez/python-write-csv-options

jzavala-gonzalez · 2024-04-10T01:02:00Z

Thank you!!

This PR adds more options to DuckDB's Python Relational API for `write_parquet`, matching the `COPY TO` options, addressing #8896: - `partition_by` - `write_partition_columns` - `overwrite` - `per_thread_output` - `use_tmp_file` - `append` I would also like to note that the `overwrite` option that was added in the `to_csv` function (#10382) technically passes `overwrite_or_ignore` to the underlying engine: https://github.com/duckdb/duckdb/blob/fd5de0607d7ab5bdddad62cc1a0225be72dff967/tools/pythonpkg/src/pyrelation.cpp#L1291-L1296 In order to match this behavior, I've also implemented it the same way. Changing it to pass `overwrite` and introducing `overwrite_or_ignore` as an option would be a [breaking change](https://duckdb.org/docs/sql/statements/copy.html#copy--to-options), thus I've avoided doing it. I've also improved the `test_to_parquet` tests by introducing new tests for the above mentioned flags, as well as parameterizing the Pandas engine (similar to the `test_to_csv` tests – using both `NumpyPandas` and `ArrowPandas`). This PR also makes the Python stubs for `{to,write}_{csv,parquet}` both match, as they are technically aliases.

jzavala-gonzalez added 5 commits January 28, 2024 22:51

adding missing options to write_csv

ff0f2d1

missed headers

8408fb1

manually add new kwargs to stub

ebfb1f3

fix partition_by option

f718eb1

run format-fix

1c6ba7c

Mytherin requested a review from Tishj January 30, 2024 09:20

Mytherin added the Ready For Review label Jan 30, 2024

Tishj suggested changes Feb 7, 2024

View reviewed changes

Draft tests for to_csv partition and overwrite

9678e2f

hive_partitioning argument also needs to be added to read_csv so it's fully Python API compatible. Also need some clarification on how the overwrite argument is supposed to work

github-actions bot marked this pull request as draft February 13, 2024 02:40

Add remaining test_to_csv tests

586ffff

For overwrite, per_thread_output, use_tmp_file

jzavala-gonzalez marked this pull request as ready for review March 1, 2024 21:45

jzavala-gonzalez added 2 commits March 18, 2024 21:11

Fix conflict in stubs

edbc6cf

Merge branch 'main' into python-write-csv-options

9cd369e

github-actions bot marked this pull request as draft March 19, 2024 01:16

Mytherin marked this pull request as ready for review March 19, 2024 07:36

Tishj approved these changes Apr 9, 2024

View reviewed changes

Mytherin merged commit 72c0c60 into duckdb:main Apr 9, 2024

fr3fou mentioned this pull request Nov 9, 2024

Add missing global options to Python's write_parquet #14766

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

python: Add missing global options to write_csv #10382

python: Add missing global options to write_csv #10382

jzavala-gonzalez commented Jan 30, 2024

Uh oh!

Mytherin commented Feb 7, 2024

Uh oh!

Tishj left a comment

Uh oh!

jzavala-gonzalez commented Mar 1, 2024

Uh oh!

Tishj left a comment

Uh oh!

jzavala-gonzalez commented Apr 10, 2024

Uh oh!

Uh oh!

python: Add missing global options to write_csv #10382

python: Add missing global options to write_csv #10382

Conversation

jzavala-gonzalez commented Jan 30, 2024

Uh oh!

Mytherin commented Feb 7, 2024

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

jzavala-gonzalez commented Mar 1, 2024

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

jzavala-gonzalez commented Apr 10, 2024

Uh oh!

Uh oh!