Skip to content

Unnecessary bytes written to parquet file #17926

@dancory-urbanfootprint

Description

@dancory-urbanfootprint

What happens?

The parquet writer writes a few unnecessary bytes when generating a particular file. This caused a problem for the hyparquet library. It appears to be legal under the parquet standard, but not necessary.

Hyparquet ticket: hyparam/hyparquet#90
Hyparquet fix: hyparam/hyparquet#91

Comments from Hyparquet developer:
Handles the case where length is prepended in a data page v1 definition/repetition level block.

This length is generally redundant, so hyparquet was ignoring it. But in RARE cases, duckdb writes an empty bitpack block at the end of an RLE BitPacked Hybrid run. So if you don't respect this length, you're hosed for reading the data page after it.

I am 99% certain this is a bug in duckdb. At the end of a RleBpEncoder run, FinishWrite gets called, and in rare cases, this results in WriteCurrentBlockBP being called with bp_block_count = 0. In other words, it writes an empty BP block for no reason! 33 wasted bytes.

To Reproduce

small23.parquet.zip

copy (select * from 'small23.parquet') to 'small23a.parquet' (format parquet, row_group_size 6144);

OS:

macOS

DuckDB Version:

1.3.0

DuckDB Client:

cli

Hardware:

No response

Full Name:

Dan Cory

Affiliation:

UrbanFootprint

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions