Skip to content

Conversation

jzavala-gonzalez
Copy link
Contributor

Addresses one of the points in #8896 .
Adds the following parameters to write_csv:

  • overwrite
  • per_thread_output
  • use_tmp_file
  • partition_by

Stubs are also edited to include said parameters.

Relevant docs: https://duckdb.org/docs/sql/statements/copy.html#copy--to-options

@Mytherin
Copy link
Collaborator

Mytherin commented Feb 7, 2024

Thanks for the PR! Looks good to me. Could you just merge with master so the CI can re-run? @Tishj can you also have a look?

Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the options, could you add some tests for the options?
The only thing we really need to verify with these tests is that they are passed correctly to the csv writing method.

But that will properly mean we need to check for the effects of the options on the resulting CSV

There exist a couple of tests for the existing option: tools/pythonpkg/tests/fast/api/test_to_csv.py

hive_partitioning argument also needs to be added to read_csv so it's fully Python API compatible. Also need some clarification on how the overwrite argument is supposed to work
@github-actions github-actions bot marked this pull request as draft February 13, 2024 02:40
For overwrite, per_thread_output, use_tmp_file
@jzavala-gonzalez jzavala-gonzalez marked this pull request as ready for review March 1, 2024 21:45
@jzavala-gonzalez
Copy link
Contributor Author

Added some tests. Note that for "partition_by", reopening is done through duckdb.sql since read_csv does not yet implement the hive_partitioning argument. "overwrite" is tested both that it works when enabled and throws when it isn't. "per_thread_output" and "use_tmp_file" tests both just check that the resulting file can be opened successfully.

@github-actions github-actions bot marked this pull request as draft March 19, 2024 01:16
@Mytherin Mytherin marked this pull request as ready for review March 19, 2024 07:36
Copy link
Contributor

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM now 👍

@Mytherin Mytherin merged commit 72c0c60 into duckdb:main Apr 9, 2024
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Apr 9, 2024
Merge pull request duckdb/duckdb#11461 from lnkuiper/parquet_dict_early_out
Merge pull request duckdb/duckdb#11137 from Tishj/sqllogic_parser
Merge pull request duckdb/duckdb#11095 from Tishj/python_struct_child_count_mismatch
Merge pull request duckdb/duckdb#10382 from jzavala-gonzalez/python-write-csv-options
@jzavala-gonzalez
Copy link
Contributor Author

Thank you!!

Mytherin added a commit that referenced this pull request Nov 11, 2024
This PR adds more options to DuckDB's Python Relational API for
`write_parquet`, matching the `COPY TO` options, addressing #8896:
- `partition_by`
- `write_partition_columns`
- `overwrite`
- `per_thread_output`
- `use_tmp_file`
- `append`

I would also like to note that the `overwrite` option that was added in
the `to_csv` function (#10382) technically passes `overwrite_or_ignore`
to the underlying engine:

https://github.com/duckdb/duckdb/blob/fd5de0607d7ab5bdddad62cc1a0225be72dff967/tools/pythonpkg/src/pyrelation.cpp#L1291-L1296
In order to match this behavior, I've also implemented it the same way. 
Changing it to pass `overwrite` and introducing `overwrite_or_ignore` as
an option would be a [breaking
change](https://duckdb.org/docs/sql/statements/copy.html#copy--to-options),
thus I've avoided doing it.

I've also improved the `test_to_parquet` tests by introducing new tests
for the above mentioned flags, as well as parameterizing the Pandas
engine (similar to the `test_to_csv` tests – using both `NumpyPandas`
and `ArrowPandas`).

This PR also makes the Python stubs for `{to,write}_{csv,parquet}` both
match, as they are technically aliases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants