Skip to content

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Apr 3, 2025

Follow-up from #16248

This PR reworks the file_row_number to be a virtual column in the Parquet reader, so the following query now works:

SELECT l_orderkey, file_row_number FROM lineitem.parquet;

This PR also implements the necessary infrastructure for allowing arbitrary virtual columns to be defined by readers, so in the future adding new virtual columns to readers will be much simpler.

This rework allows for the removal of a bunch of hacky special-case code around the file_row_number column - this can now all live in the Parquet reader itself. Emitting the file row number is as simple as adding the special code (MultiFileReader::COLUMN_IDENTIFIER_FILE_ROW_NUMBER ) to the set of projected column ids.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 3, 2025 16:45
@Mytherin Mytherin marked this pull request as ready for review April 3, 2025 18:32
@Tishj
Copy link
Contributor

Tishj commented Apr 3, 2025

So in iceberg/delta we can set MultiFileReader::COLUMN_IDENTIFIER_FILE_ROW_NUMBER and then it will automatically not be emitted because no Expression exists for it (if the user hasn't requested it), causing it to be removed by the FinalizeChunk step?

@Mytherin
Copy link
Collaborator Author

Mytherin commented Apr 3, 2025

There are two ways to refer to the file_row_number now:

  • Through pushing the (virtual) column_id MultiFileReader::COLUMN_IDENTIFIER_FILE_ROW_NUMBER
  • Through pushing the field identifier ORDINAL_FIELD_ID (which is a reserved field identifier as per the iceberg spec specifically for this column)

Which option is best depends on where this is used in the code.

@Mytherin Mytherin merged commit 3120809 into duckdb:main Apr 4, 2025
57 of 58 checks passed
Mytherin added a commit that referenced this pull request May 2, 2025
…(*)` directly in the multi file reader (#17325)

This PR generalizes the late materialization optimizer introduced in
#15692 - allowing it to be used for
the Parquet reader.

In particular, the `TableFunction` is extended with an extra callback
that allows specifying the relevant row-id columns:

```cpp
typedef vector<column_t> (*table_function_get_row_id_columns)(ClientContext &context,
                                                              optional_ptr<FunctionData> bind_data);
```

This is then used by the Parquet reader to specify the two row-id
columns: `file_index` (#17144) and
`file_row_number` (#16979). Top-N ,
sample and limit/offset queries are then transformed into a join on the
relevant row-id columns. For example:


```sql
SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;

-- becomes

SELECT * FROM lineitem.parquet WHERE (file_index, file_row_number) IN (
    SELECT file_index, file_row_number FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5)
ORDER BY l_extendedprice DESC;
```

### Performance

```sql
SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;
```

| v1.2.1 |  main  |  new   |
|--------|--------|--------|
| 0.19s  | 0.14s | 0.06s |


```sql
SELECT * FROM lineitem.parquet ORDER BY l_orderkey DESC LIMIT 5;
```
| v1.2.1 | main  |  new  |
|--------|-------|-------|
| 0.73s  | 0.53s | 0.06s |

```sql
SELECT * FROM lineitem.parquet LIMIT 1000000 OFFSET 10000000;
```
| v1.2.1 | main  |  new  |
|--------|-------|-------|
| 1.6s   | 1.2s | 0.14s |


### Refactor

I've also moved the `ParquetMultiFileInfo` to a separate file as part of
this PR - which is most of the changes here.
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
Make `file_row_number` a virtual column, and support per-file virtual columns in the MultiFileReader (duckdb/duckdb#16979)
@Mytherin Mytherin deleted the filerownumbervirtual branch June 12, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants