-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
The parquet writer writes a few unnecessary bytes when generating a particular file. This caused a problem for the hyparquet library. It appears to be legal under the parquet standard, but not necessary.
Hyparquet ticket: hyparam/hyparquet#90
Hyparquet fix: hyparam/hyparquet#91
Comments from Hyparquet developer:
Handles the case where length is prepended in a data page v1 definition/repetition level block.
This length is generally redundant, so hyparquet was ignoring it. But in RARE cases, duckdb writes an empty bitpack block at the end of an RLE BitPacked Hybrid run. So if you don't respect this length, you're hosed for reading the data page after it.
I am 99% certain this is a bug in duckdb. At the end of a RleBpEncoder run, FinishWrite gets called, and in rare cases, this results in WriteCurrentBlockBP being called with bp_block_count = 0. In other words, it writes an empty BP block for no reason! 33 wasted bytes.
To Reproduce
copy (select * from 'small23.parquet') to 'small23a.parquet' (format parquet, row_group_size 6144);
OS:
macOS
DuckDB Version:
1.3.0
DuckDB Client:
cli
Hardware:
No response
Full Name:
Dan Cory
Affiliation:
UrbanFootprint
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have