Skip to content

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented May 1, 2025

This PR generalizes the late materialization optimizer introduced in #15692 - allowing it to be used for the Parquet reader.

In particular, the TableFunction is extended with an extra callback that allows specifying the relevant row-id columns:

typedef vector<column_t> (*table_function_get_row_id_columns)(ClientContext &context,
                                                              optional_ptr<FunctionData> bind_data);

This is then used by the Parquet reader to specify the two row-id columns: file_index (#17144) and file_row_number (#16979). Top-N , sample and limit/offset queries are then transformed into a join on the relevant row-id columns. For example:

SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;

-- becomes

SELECT * FROM lineitem.parquet WHERE (file_index, file_row_number) IN (
    SELECT file_index, file_row_number FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5)
ORDER BY l_extendedprice DESC;

Performance

SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;
v1.2.1 main new
0.19s 0.14s 0.06s
SELECT * FROM lineitem.parquet ORDER BY l_orderkey DESC LIMIT 5;
v1.2.1 main new
0.73s 0.53s 0.06s
SELECT * FROM lineitem.parquet LIMIT 1000000 OFFSET 10000000;
v1.2.1 main new
1.6s 1.2s 0.14s

Refactor

I've also moved the ParquetMultiFileInfo to a separate file as part of this PR - which is most of the changes here.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 1, 2025 17:23
@Mytherin Mytherin marked this pull request as ready for review May 1, 2025 17:25
@Mytherin Mytherin merged commit ced12aa into duckdb:main May 2, 2025
49 checks passed
@djouallah
Copy link

@Mytherin do you have any plan to extend this to filtering ? for example

from './orders/*.parquet' where o_orderkey = 1

@Mytherin
Copy link
Collaborator Author

Mytherin commented May 3, 2025

Filtering would benefit far less from this - most of the benefits are already gained from pushing filters into the scan (which we already do).

It is also much harder to do for filters since we need to make the decision of whether or not to use late materialization during planning - and it is only beneficial when the result set is very small. That means we need to accurately predict that the result of a filter is small during optimization time which is difficult.

Perhaps something we could do for filtering is e.g. dynamic prefetching, where we choose whether or not to prefetch other columns based on how selective filters are - if filters are very selective we stop pre-emptively prefetching columns. But that's not directly related to this PR of course.

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
Support late materialization in the Parquet reader, and handle `COUNT(*)` directly in the multi file reader (duckdb/duckdb#17325)
@Mytherin Mytherin deleted the multifilereaderrowid branch June 12, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants