Skip to content

Conversation

hannes
Copy link
Member

@hannes hannes commented Mar 18, 2024

This PR adds support for LZ4 compression to the Parquet reader and writer. As per the Parquet compression spec, we implement only the newer LZ4_RAW scheme and not the deprecated LZ4 scheme. However, we allow both variants to be used in COPY commands because the subtle difference is reasonably lost on users:

COPY ... (FORMAT 'parquet', CODEC 'LZ4');
COPY ... (FORMAT 'parquet', CODEC 'LZ4_RAW');

will mean the same, and LZ4_RAW will be used.

@github-actions github-actions bot marked this pull request as draft March 18, 2024 12:38
@hannes hannes marked this pull request as ready for review March 18, 2024 12:38
statement error
SELECT * FROM parquet_scan('data/parquet-testing/compression/generated/data_page=2_LZ4.parquet') limit 50
query IIII
SELECT * FROM parquet_scan('data/parquet-testing/compression/generated/data_page=2_ZSTD.parquet', hive_partitioning=0) limit 50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but @samansmink this seems like a bug in the hive partitioning. Equality in file names shouldn't be auto-detected (or even interpreted) as hive partitions, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I don't even think the filename should ever be able to contain an encoded partition, let alone be autodetected as such.

Copy link
Contributor

@Tishj Tishj Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this stems from #7344 then?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to have added the work-around for these files, but file names shouldn't be considered hive partitions in any case also not when explicitly enabled

@github-actions github-actions bot marked this pull request as draft March 18, 2024 14:12
@hannes hannes marked this pull request as ready for review March 18, 2024 14:12
@Mytherin
Copy link
Collaborator

Thanks for the PR! Looks good to me - just seems like the LZ4 symbols need to be namespaced to fix the symbol leakage test

@github-actions github-actions bot marked this pull request as draft March 19, 2024 13:40
@Mytherin Mytherin marked this pull request as ready for review March 19, 2024 13:52
@github-actions github-actions bot marked this pull request as draft March 19, 2024 14:35
@hannes hannes marked this pull request as ready for review March 19, 2024 14:35
@hannes hannes merged commit 1693804 into duckdb:main Mar 20, 2024
@szarnyasg szarnyasg added the Needs Documentation Use for issues or PRs that require changes in the documentation label Mar 20, 2024
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 20, 2024
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Mar 23, 2024
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation Ready To Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants