-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[MultiFileReader] Rework MultiFileReader::FinalizeChunk
to use Expressions
#16630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MultiFileReader] Rework MultiFileReader::FinalizeChunk
to use Expressions
#16630
Conversation
…d an output_chunk
…ile_reader_finalize_chunk
…mkay. If you have stale references you are bad, because stale references are bad. mkay. It's a bad thing to have stale references, so don't be bad by having stale references. mkay. Don't be bad. mkay
…gh, this could be shared by multiple threads, move it to the MultiFileLocalState instead
…ocal) col_idx -> global_idx, in the context of filters
…seen filter/stats issues if we keep these changes
…ants of the right type
…ile_reader_finalize_chunk
…ile_reader_finalize_chunk
@Mytherin we talked about the ExpressionExecutor construction/initialization being wasteful inside FinalizeChunk but I don't think it's possible to fix this, the ExpressionExecutor is not safe to share, because its state is mutable. Perhaps it can be moved into the |
It can't be shared indeed - it needs to be in |
That makes sense, so the construction of the ExpressionExecutor is what we're most concerned with, not necessarily the AddExpression calls? So we don't attach the expression executor to the lifetime of the file, we attach it to that of the thread (local) state? |
No the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good - one more comment then this is good to go
… - this failed locally without it
This is the same issue as https://github.com/duckdblabs/duckdb-internal/issues/4499 |
…incorrectly changed by my PR or a previous PR of mine
Now its an http failure @Mytherin |
Thanks! |
#16838) This PR is a follow up to #16630 The aim of this PR is to not require the underlying reader to make use of complicated mappings created in the `MultiFileReaderData` to correctly handle filters pushed into the reader. Previously the `TableFilterSet` handed to the underlying reader contained "global" filters, i.e the `column_ids` referenced the "global" schema, which is not guaranteed to match the schema of the file that the reader is reading. To convert these into the "local" schema (that of the file that is being read), the `filter_map` had to be used, then the filter entry had to be checked to see if the filter points to a column that is present in the file being read, if it's not then the filter points to the entry in the `constant_map` that should be used instead. ---------------------------------- After this PR, the underlying reader is handed "local" filters, that already contain `column_ids` that reference the "local" schema. No filter will be found that references columns that are not present in the file, as they have been evaluated beforehand. These filters (the ones targeting columns that are not in the file) can be safely removed because if the filter doesn't match we skip the file entirely, it does not have to be scanned. If the filter *does* match, it can be safely ignored by the reader. ---------------------------------- ### Open questions / Areas for future work discovered #### Copied `child_indexes` issue The `child_indexes` of the global `ColumnIndex` are copied 1-to-1 in the creation of the mapping still: ```c++ //! FIXME: local fields are not guaranteed to match with the global fields for this struct local_index = ColumnIndex(local_id.GetId(), global_id.GetChildIndexes()); } ``` As the fixme says, this probably shouldn't happen, because schema evolution could create a divide between the global and local fields for a given struct. #### `DynamicFilter` copying DynamicFilter works similar to a BoundParameterExpression in that it contains a shared_ptr to a shared structure to allow the filter to be updated. This can't really be copied and still function correctly. Perhaps at this stage the DynamicFilter has served its purpose and can be safely transformed into a non-dynamic filter? I did not feel confident in making this decision. #### `StructFilter` schema evolution The StructFilter besides a `child_idx` (which suffers from the issue mentioned above) also has a `child_name`. Through schema evolution the field could be renamed, this information should probably be added to the `MultiFileReaderColumnDefinition` to properly handle renames of struct fields. Or that information is already present, through the LogicalType attached to the column definition, but it needs to be used properly. #### `cast_map` casts The `cast_map` is still present, and necessary to be respected by the underlying reader, because delaying this cast would complicate the filtering logic. If we look at the following flow: `MFR <-> READER <-> FILE` Taking the parquet scanner as an example; Currently, the `cast_map` is applied in the translation from the `FILE` to the `READER`, through a CastColumnReader. This reader isn't anything special, it just applies a DefaultCastAs to the vector read from the reader it wraps, this can easily be replaced by a BoundCastExpression that is applied in the `FinalizeChunk`. The problem with moving this cast up, is that it would be applied in the translation from `READER` to the `MFR`. This is a problem because the filter pushdown is done inside the READER (after the cast is performed), and so the filters are using (and expecting) the "global" type, the target of the cast. To move the cast up into the translation from `READER` to the `MFR` the filters that are sent to the reader would need to be changed to work on the "local" type instead, which is not an easy refactor.
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630) Revert "fix: drop useless python import" (duckdb/duckdb#16834)
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630) Revert "fix: drop useless python import" (duckdb/duckdb#16834)
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630) Revert "fix: drop useless python import" (duckdb/duckdb#16834)
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630) Revert "fix: drop useless python import" (duckdb/duckdb#16834)
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630) Revert "fix: drop useless python import" (duckdb/duckdb#16834)
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630) Revert "fix: drop useless python import" (duckdb/duckdb#16834)
Previously the underlying reader (such as csv or parquet) would read directly into a chunk that matches the "global" schema, containing columns that are perhaps constants for this file.
Because of this, the
column_mapping
had to be applied in the underlying reader to find the Vector that should be written into.This PR replaces that, letting the underlying reader be unaware of the global schema, moving this mapping logic up to the FinalizeChunk method of the MultiFileReader implementation.
I was hoping to remove the need to expose the
column_mapping
to the underlying reader entirely, along withconstant_map
,cast_map
andfilter_map
, but this is not possible without a bigger rework.EDIT:
A follow up PR is in the works to make the underlying reader unaware of the global schema when it comes to filters