Skip to content

Conversation

samansmink
Copy link
Contributor

Fixes the following bug:

SELECT * from read_parquet("https://github.com/apache/arrow/raw/main/r/inst/v0.7.1.parquet");

Leading to

Invalid Input Error: Malformed parquet file: sum of total compressed bytes of columns seems incorrect

why this happens?

The reason is that in the respective parquet file the index_page_offset fields are set to 0. DuckDB uses this value to compute the byte range that needs to be prefetched for this column chunk. This prefetching mechanism now no longer works properly because DuckDB can not easily determine the byte ranges to prefetch

The fix

The workaround is to add an option that disables the Prefetching mechanism and hinting the users to use that option. This will disable the prefetching mechanism for these files allowing them to be scanned.

To be able to test this easily I also added the debug option prefetch_all_parquet_files

This forces users to become aware that this is happening and preventing quiet performance regressions
Comment on lines 21 to 22
# This file messes with DuckDB's prefetching mechanism, however the prefetching mechanism automatically disables
# when it detects this situation meaning that the query still succeeds (possibly at a performance penalty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated? Also the title of the PR I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, thx!

@samansmink samansmink changed the title Auto-disable parquet prefetching on malformed metadata Provide workaround for prefetching parquet files with incorrect page offsets Sep 2, 2024
@duckdb-draftbot duckdb-draftbot marked this pull request as draft September 2, 2024 13:40
@Mytherin Mytherin marked this pull request as ready for review September 2, 2024 18:12
@duckdb-draftbot duckdb-draftbot marked this pull request as draft September 3, 2024 15:17
@samansmink samansmink marked this pull request as ready for review September 4, 2024 12:22
@samansmink
Copy link
Contributor Author

I think this is good to go @Mytherin?

@Mytherin Mytherin changed the base branch from main to feature September 26, 2024 09:10
@Mytherin Mytherin merged commit 373b56f into duckdb:feature Sep 26, 2024
41 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants