Skip to content

Conversation

Mytherin
Copy link
Collaborator

This got lost during the column writer unification. This fixes a regression in the Parquet writer where, when writing large strings, the page size would be greatly underestimated. This is used upstream in the writer to limit page size to 100MB. Since the page size would be underestimated, we would be writing much larger pages than this when writing large strings, leading to potential memory issues. In addition, since Parquet pages are limited to 2GB (on account of page size being stored as an int32) - I think this could also lead to too large pages being written that could not be correctly represented in the Parquet spec in certain edge cases.

@Mytherin Mytherin changed the title Parquet writer: Re-implement GetRowSize Parquet writer: Re-implement GetRowSize for Strings Feb 11, 2025
@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 11, 2025 11:29
@Mytherin Mytherin marked this pull request as ready for review February 11, 2025 11:30
@Mytherin Mytherin merged commit 989e653 into duckdb:v1.2-histrionicus Feb 11, 2025
50 checks passed
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Mar 7, 2025
Parquet writer: Re-implement GetRowSize for Strings (duckdb/duckdb#16178)
@Mytherin Mytherin deleted the getrowsize branch April 2, 2025 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant