Parquet writer: Re-implement GetRowSize for Strings #16178
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This got lost during the column writer unification. This fixes a regression in the Parquet writer where, when writing large strings, the page size would be greatly underestimated. This is used upstream in the writer to limit page size to 100MB. Since the page size would be underestimated, we would be writing much larger pages than this when writing large strings, leading to potential memory issues. In addition, since Parquet pages are limited to 2GB (on account of page size being stored as an
int32
) - I think this could also lead to too large pages being written that could not be correctly represented in the Parquet spec in certain edge cases.