Provide workaround for prefetching parquet files with incorrect page offsets #13697

samansmink · 2024-09-02T12:39:26Z

Fixes the following bug:

SELECT * from read_parquet("https://github.com/apache/arrow/raw/main/r/inst/v0.7.1.parquet");

Leading to

Invalid Input Error: Malformed parquet file: sum of total compressed bytes of columns seems incorrect

why this happens?

The reason is that in the respective parquet file the index_page_offset fields are set to 0. DuckDB uses this value to compute the byte range that needs to be prefetched for this column chunk. This prefetching mechanism now no longer works properly because DuckDB can not easily determine the byte ranges to prefetch

The fix

The workaround is to add an option that disables the Prefetching mechanism and hinting the users to use that option. This will disable the prefetching mechanism for these files allowing them to be scanned.

To be able to test this easily I also added the debug option prefetch_all_parquet_files

…nism when detecting them

This forces users to become aware that this is happening and preventing quiet performance regressions

carlopi · 2024-09-02T13:29:47Z

test/parquet/prefetching.test

+# This file messes with DuckDB's prefetching mechanism, however the prefetching mechanism automatically disables
+# when it detects this situation meaning that the query still succeeds (possibly at a performance penalty) 


Outdated? Also the title of the PR I think

indeed, thx!

samansmink · 2024-09-26T09:08:33Z

I think this is good to go @Mytherin?

Mytherin · 2024-09-26T09:12:16Z

Thanks!

samansmink added 3 commits September 2, 2024 13:58

allow incorrect index page offsets by disabling the prefetching mecha…

8f2e835

…nism when detecting them

add test for prefetching mechanism

282bddb

switch to failing by default, but allowing the workaround

a62507c

This forces users to become aware that this is happening and preventing quiet performance regressions

carlopi reviewed Sep 2, 2024

View reviewed changes

samansmink changed the title ~~Auto-disable parquet prefetching on malformed metadata~~ Provide workaround for prefetching parquet files with incorrect page offsets Sep 2, 2024

remove outdated comment

274a84d

duckdb-draftbot marked this pull request as draft September 2, 2024 13:40

Mytherin marked this pull request as ready for review September 2, 2024 18:12

add missing settings to autoloader

236470a

duckdb-draftbot marked this pull request as draft September 3, 2024 15:17

use correct order for settings

d445dba

samansmink marked this pull request as ready for review September 4, 2024 12:22

Mytherin changed the base branch from main to feature September 26, 2024 09:10

Mytherin merged commit 373b56f into duckdb:feature Sep 26, 2024
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide workaround for prefetching parquet files with incorrect page offsets #13697

Provide workaround for prefetching parquet files with incorrect page offsets #13697

Uh oh!

samansmink commented Sep 2, 2024

Uh oh!

carlopi Sep 2, 2024

Uh oh!

samansmink Sep 2, 2024

Uh oh!

samansmink commented Sep 26, 2024

Uh oh!

Uh oh!

Mytherin commented Sep 26, 2024

Uh oh!

Uh oh!

		# This file messes with DuckDB's prefetching mechanism, however the prefetching mechanism automatically disables
		# when it detects this situation meaning that the query still succeeds (possibly at a performance penalty)

Provide workaround for prefetching parquet files with incorrect page offsets #13697

Provide workaround for prefetching parquet files with incorrect page offsets #13697

Uh oh!

Conversation

samansmink commented Sep 2, 2024

why this happens?

The fix

Uh oh!

carlopi Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

samansmink Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

samansmink commented Sep 26, 2024

Uh oh!

Uh oh!

Mytherin commented Sep 26, 2024

Uh oh!

Uh oh!