Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader #16979

Mytherin · 2025-04-03T16:41:16Z

Follow-up from #16248

This PR reworks the file_row_number to be a virtual column in the Parquet reader, so the following query now works:

SELECT l_orderkey, file_row_number FROM lineitem.parquet;

This PR also implements the necessary infrastructure for allowing arbitrary virtual columns to be defined by readers, so in the future adding new virtual columns to readers will be much simpler.

This rework allows for the removal of a bunch of hacky special-case code around the file_row_number column - this can now all live in the Parquet reader itself. Emitting the file row number is as simple as adding the special code (MultiFileReader::COLUMN_IDENTIFIER_FILE_ROW_NUMBER ) to the set of projected column ids.

…use this to make file_row_number a virtual column

…umber in conjunction with schema

Tishj · 2025-04-03T19:18:08Z

So in iceberg/delta we can set MultiFileReader::COLUMN_IDENTIFIER_FILE_ROW_NUMBER and then it will automatically not be emitted because no Expression exists for it (if the user hasn't requested it), causing it to be removed by the FinalizeChunk step?

Mytherin · 2025-04-03T19:37:24Z

There are two ways to refer to the file_row_number now:

Through pushing the (virtual) column_id MultiFileReader::COLUMN_IDENTIFIER_FILE_ROW_NUMBER
Through pushing the field identifier ORDINAL_FIELD_ID (which is a reserved field identifier as per the iceberg spec specifically for this column)

Which option is best depends on where this is used in the code.

…(*)` directly in the multi file reader (#17325) This PR generalizes the late materialization optimizer introduced in #15692 - allowing it to be used for the Parquet reader. In particular, the `TableFunction` is extended with an extra callback that allows specifying the relevant row-id columns: ```cpp typedef vector<column_t> (*table_function_get_row_id_columns)(ClientContext &context, optional_ptr<FunctionData> bind_data); ``` This is then used by the Parquet reader to specify the two row-id columns: `file_index` (#17144) and `file_row_number` (#16979). Top-N , sample and limit/offset queries are then transformed into a join on the relevant row-id columns. For example: ```sql SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5; -- becomes SELECT * FROM lineitem.parquet WHERE (file_index, file_row_number) IN ( SELECT file_index, file_row_number FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5) ORDER BY l_extendedprice DESC; ``` ### Performance ```sql SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5; ``` | v1.2.1 | main | new | |--------|--------|--------| | 0.19s | 0.14s | 0.06s | ```sql SELECT * FROM lineitem.parquet ORDER BY l_orderkey DESC LIMIT 5; ``` | v1.2.1 | main | new | |--------|-------|-------| | 0.73s | 0.53s | 0.06s | ```sql SELECT * FROM lineitem.parquet LIMIT 1000000 OFFSET 10000000; ``` | v1.2.1 | main | new | |--------|-------|-------| | 1.6s | 1.2s | 0.14s | ### Refactor I've also moved the `ParquetMultiFileInfo` to a separate file as part of this PR - which is most of the changes here.

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)

Mytherin added 4 commits April 3, 2025 17:51

Add support for per-file virtual columns in the MultiFileReader, and …

368e9bf

…use this to make file_row_number a virtual column

Use the Iceberg reserved field id for file row number, fix file_row_n…

c678c64

…umber in conjunction with schema

Add test

31bb7a4

Remove unused file_row_number idx

8241218

duckdb-draftbot marked this pull request as draft April 3, 2025 16:45

Mytherin marked this pull request as ready for review April 3, 2025 18:32

Mytherin merged commit 3120809 into duckdb:main Apr 4, 2025
57 of 58 checks passed

Mytherin mentioned this pull request May 1, 2025

Support late materialization in the Parquet reader, and handle COUNT(*) directly in the multi file reader #17325

Merged

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@3120809

0769f30

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@3120809

e456cfb

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025

vendor: Update vendored sources to duckdb/duckdb@3120809

e31fb31

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025

vendor: Update vendored sources to duckdb/duckdb@3120809

5de81f4

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025

vendor: Update vendored sources to duckdb/duckdb@3120809

49a1574

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)

Mytherin deleted the filerownumbervirtual branch June 12, 2025 15:29

Tishj mentioned this pull request Jun 23, 2025

Parquet Ambiguous Binder file_row_number #18026

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader #16979

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader #16979

Mytherin commented Apr 3, 2025

Uh oh!

Tishj commented Apr 3, 2025 •

edited

Loading

Uh oh!

Mytherin commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

Make file_row_number a virtual column, and support per-file virtual columns in the MultiFileReader #16979

Make file_row_number a virtual column, and support per-file virtual columns in the MultiFileReader #16979

Conversation

Mytherin commented Apr 3, 2025

Uh oh!

Tishj commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin commented Apr 3, 2025

Uh oh!

Uh oh!

Uh oh!

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader #16979

Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader #16979

Tishj commented Apr 3, 2025 •

edited

Loading