Add missing global options to Python's `write_parquet` #14766

fr3fou · 2024-11-09T12:11:24Z

This PR adds more options to DuckDB's Python Relational API for write_parquet, matching the COPY TO options, addressing #8896:

partition_by
write_partition_columns
overwrite
per_thread_output
use_tmp_file
append

I would also like to note that the overwrite option that was added in the to_csv function (#10382) technically passes overwrite_or_ignore to the underlying engine:

duckdb/tools/pythonpkg/src/pyrelation.cpp

Lines 1291 to 1296 in fd5de06

    
           if (!py::none().is(overwrite)) { 
        
           	if (!py::isinstance<py::bool_>(overwrite)) { 
        
           		throw InvalidInputException("to_csv only accepts 'overwrite' as a boolean"); 
        
           	} 
        
           	options["overwrite_or_ignore"] = {Value::BOOLEAN(py::bool_(overwrite))}; 
        
           }

In order to match this behavior, I've also implemented it the same way.
Changing it to pass overwrite and introducing overwrite_or_ignore as an option would be a breaking change, thus I've avoided doing it.

I've also improved the test_to_parquet tests by introducing new tests for the above mentioned flags, as well as parameterizing the Pandas engine (similar to the test_to_csv tests – using both NumpyPandas and ArrowPandas).

This PR also makes the Python stubs for {to,write}_{csv,parquet} both match, as they are technically aliases.

Tishj

Thanks!

Add operator name to profiling output (duckdb/duckdb#14744) Add missing global options to Python's `write_parquet` (duckdb/duckdb#14766)

Add operator name to profiling output (duckdb/duckdb#14744) Add missing global options to Python's `write_parquet` (duckdb/duckdb#14766) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

fr3fou added 7 commits November 9, 2024 11:52

feat: add partition_by & write_partition_columns to write_parquet

c0dabf1

chore: match stubs between {to,write}_{csv,parquet}

c0da3ef

feat: add tests

c0da101

feat: add overwrite, per_thread_output, use_tmp_file

c0da07a

feat: add append option

c0daa24

feat: test to_parquet using both numpy and arrow pandas

c0da17d

chore: format

c0daa19

duckdb-draftbot marked this pull request as draft November 9, 2024 12:16

fr3fou marked this pull request as ready for review November 9, 2024 12:16

Tishj approved these changes Nov 9, 2024

View reviewed changes

Mytherin merged commit 1aa2a7c into duckdb:main Nov 11, 2024
19 checks passed

fr3fou deleted the python-api-write-options branch November 11, 2024 09:04

github-actions bot mentioned this pull request Dec 21, 2024

vendor: Update vendored sources to duckdb/duckdb@b5fea5d7396f055753e50fdc0b321bf57e96219b duckdb/duckdb-r#670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add missing global options to Python's `write_parquet` #14766

Add missing global options to Python's `write_parquet` #14766

fr3fou commented Nov 9, 2024 •

edited

Loading

Uh oh!

Tishj left a comment

Uh oh!

Uh oh!

Uh oh!

	if (!py::none().is(overwrite)) {
	if (!py::isinstance<py::bool_>(overwrite)) {
	throw InvalidInputException("to_csv only accepts 'overwrite' as a boolean");
	}
	options["overwrite_or_ignore"] = {Value::BOOLEAN(py::bool_(overwrite))};
	}

Add missing global options to Python's write_parquet #14766

Add missing global options to Python's write_parquet #14766

Conversation

fr3fou commented Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add missing global options to Python's `write_parquet` #14766

Add missing global options to Python's `write_parquet` #14766

fr3fou commented Nov 9, 2024 •

edited

Loading