Skip to content

Conversation

Maxxen
Copy link
Member

@Maxxen Maxxen commented Sep 16, 2024

When writing GeoParquet, we stored the column-level auxiliary geoparquet statistics/metadata directly in the GeometryColumnWriter, which is not thread safe, causing issues when multiple row-groups are written concurrently.

This PR moves the column-level geoparquet metadata from the ColumnWriter into the associated ColumnWriterState, and also introduces a write-lock on the file-level geoparquet metadata struct so that the threads don't race each other when flushing the stats after finishing a row group.

Additionally I've changed the GeometryColumnWriter into a WKBColumnWriter that directly subclasses the StringColumnWriter instead of trying to template the base class as that is significantly messier to do when we also need to wrap the associated state. This makes the code a bit cleaner, and makes sense given that the WKB geometry encoding is the only one we support right now anyway. I've also realized that the implementation for the other encodings is going to end up different enough that there is probably no point in trying to generalize all the geoparquet specific encoding logic into a common base class anyway.

Closes: #13914

@Mytherin Mytherin merged commit b8c5fa9 into duckdb:main Sep 16, 2024
37 of 38 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Sep 26, 2024
Fix data race when writing GeoParquet (duckdb/duckdb#13962)
Modify create_art_varchar.benchmark so it passes weekly regressions (duckdb/duckdb#13968)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Sep 26, 2024
Fix data race when writing GeoParquet (duckdb/duckdb#13962)
Modify create_art_varchar.benchmark so it passes weekly regressions (duckdb/duckdb#13968)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

malloc heap corruption when reading and writing to parquet
2 participants