[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions #16630

Tishj · 2025-03-13T11:23:35Z

Previously the underlying reader (such as csv or parquet) would read directly into a chunk that matches the "global" schema, containing columns that are perhaps constants for this file.
Because of this, the column_mapping had to be applied in the underlying reader to find the Vector that should be written into.

This PR replaces that, letting the underlying reader be unaware of the global schema, moving this mapping logic up to the FinalizeChunk method of the MultiFileReader implementation.

I was hoping to remove the need to expose the column_mapping to the underlying reader entirely, along with constant_map, cast_map and filter_map, but this is not possible without a bigger rework.

EDIT:
A follow up PR is in the works to make the underlying reader unaware of the global schema when it comes to filters

…d an output_chunk

…inalize_chunk

…ile_reader_finalize_chunk

…mkay. If you have stale references you are bad, because stale references are bad. mkay. It's a bad thing to have stale references, so don't be bad by having stale references. mkay. Don't be bad. mkay

…gh, this could be shared by multiple threads, move it to the MultiFileLocalState instead

…ocal) col_idx -> global_idx, in the context of filters

…seen filter/stats issues if we keep these changes

…ants of the right type

…ile_reader_finalize_chunk

…inalize_chunk

Tishj · 2025-03-13T11:27:15Z

@Mytherin we talked about the ExpressionExecutor construction/initialization being wasteful inside FinalizeChunk but I don't think it's possible to fix this, the ExpressionExecutor is not safe to share, because its state is mutable.
There could be multiple threads executing FinalizeChunk on chunks that originates from the same source file at the same time, so this can't be placed in the MultiFileReaderData.

Perhaps it can be moved into the MultiFileScan body, so the FinalizeChunk method requires a ExpressionExecutor & to be passed in?

Mytherin · 2025-03-13T11:29:58Z

It can't be shared indeed - it needs to be in MultiFileLocalState as part of the thread-local table function state. It can be initialized in TryInitializeNextBatch and passed into FinalizeChunk.

Tishj · 2025-03-13T12:05:23Z

It can't be shared indeed - it needs to be in MultiFileLocalState as part of the thread-local table function state. It can be initialized in TryInitializeNextBatch and passed into FinalizeChunk.

That makes sense, so the construction of the ExpressionExecutor is what we're most concerned with, not necessarily the AddExpression calls?
(while also limiting the add expression calls to only be done once per file handled by the thread)

So we don't attach the expression executor to the lifetime of the file, we attach it to that of the thread (local) state?

Mytherin · 2025-03-13T12:59:32Z

No the AddExpression calls are the most expensive since that sets up all of the expression states/buffers - but once per file is fine.

…inalize_chunk

Mytherin

Thanks for the PR! Looks good - one more comment then this is good to go

src/include/duckdb/common/multi_file_reader_function.hpp

…inalize_chunk

… - this failed locally without it

…inalize_chunk

Tishj · 2025-03-25T17:23:45Z

Run ./build/release/duckdb -c "PRAGMA platform;"
┌──────────────────┐
│     platform     │
│     varchar      │
├──────────────────┤
│ linux_amd64_gcc4 │
└──────────────────┘

2025-03-25T16:13:49.1845660Z [1/1] (100%): test/sql/copy/csv/afl/fuzz_20250211_crash.test                    
2025-03-25T16:13:49.1846009Z ===============================================================================
2025-03-25T16:13:49.1846273Z test cases: 1 | 1 failed
2025-03-25T16:13:49.1846492Z assertions: 2 | 1 passed | 1 failed
2025-03-25T16:13:49.1846665Z 
2025-03-25T16:13:49.1846669Z 
2025-03-25T16:13:49.1846739Z --------------------
2025-03-25T16:13:49.1847131Z STDERR
2025-03-25T16:13:49.1847304Z --------------------
2025-03-25T16:13:49.1847514Z ================================================================================
2025-03-25T16:13:49.1847927Z Query failed with internal exception! (test/sql/copy/csv/afl/fuzz_20250211_crash.test:8)!
2025-03-25T16:13:49.1848343Z ================================================================================
2025-03-25T16:13:49.1848726Z FROM read_csv('data/csv/afl/20250211_csv_fuzz_crash/case_53.csv', buffer_size=42);
2025-03-25T16:13:49.1849080Z Actual result:
2025-03-25T16:13:49.1849277Z ================================================================================
2025-03-25T16:13:49.1849623Z INTERNAL Error: Attempted to access index 3 within vector of size 3
2025-03-25T16:13:49.1849876Z 
2025-03-25T16:13:49.1849954Z Stack Trace:
2025-03-25T16:13:49.1850061Z 
2025-03-25T16:13:49.1850582Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::Exception::ToJSON(duckdb::ExceptionType, std::string const&)+0x53) [0x7efd101e0883]
2025-03-25T16:13:49.1851645Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::Exception::Exception(duckdb::ExceptionType, std::string const&)+0x16) [0x7efd101e08b6]
2025-03-25T16:13:49.1852888Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::InternalException::InternalException(std::string const&)+0x11) [0x7efd101e2f01]
2025-03-25T16:13:49.1853959Z build/release/test/unittest(duckdb::InternalException::InternalException<unsigned long, unsigned long>(std::string const&, unsigned long, unsigned long)+0x137) [0x67da17]
2025-03-25T16:13:49.1854794Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(+0x975282) [0x7efd0f775282]
2025-03-25T16:13:49.1855409Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(+0x15ceefe) [0x7efd103ceefe]
2025-03-25T16:13:49.1856203Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::StringValueScanner::Flush(duckdb::DataChunk&)+0xb1) [0x7efd103cf191]
2025-03-25T16:13:49.1859526Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::MultiFileReaderFunction<duckdb::CSVMultiFileInfo>::MultiFileScan(duckdb::ClientContext&, duckdb::TableFunctionInput&, duckdb::DataChunk&)+0xae) [0x7efd0fe86b8e]
2025-03-25T16:13:49.1862116Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::PhysicalTableScan::GetData(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSourceInput&) const+0x52) [0x7efd104aa7c2]
2025-03-25T16:13:49.1864253Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::PipelineExecutor::FetchFromSource(duckdb::DataChunk&)+0x78) [0x7efd1063df48]
2025-03-25T16:13:49.1865457Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::PipelineExecutor::Execute(unsigned long)+0x12b) [0x7efd1064929b]
2025-03-25T16:13:49.1866643Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode)+0xe4) [0x7efd10649524]
2025-03-25T16:13:49.1867899Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::ExecutorTask::Execute(duckdb::TaskExecutionMode)+0xce) [0x7efd1063fb9e]
2025-03-25T16:13:49.1868911Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::TaskScheduler::ExecuteForever(std::atomic<bool>*)+0x127) [0x7efd10647fb7]
2025-03-25T16:13:49.1869711Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(+0x2455310) [0x7efd11255310]
2025-03-25T16:13:49.1870201Z /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7efd0e69caa4]
2025-03-25T16:13:49.1870550Z /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7efd0e729c3c]
2025-03-25T16:13:49.1870762Z 
2025-03-25T16:13:49.1871141Z This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
2025-03-25T16:13:49.1871788Z For more information, see https://duckdb.org/docs/dev/internal_errors

This is the same issue as https://github.com/duckdblabs/duckdb-internal/issues/4499

…incorrectly changed by my PR or a previous PR of mine

Tishj · 2025-03-26T06:34:13Z

Now its an http failure @Mytherin

Mytherin · 2025-03-26T08:15:02Z

Thanks!

#16838) This PR is a follow up to #16630 The aim of this PR is to not require the underlying reader to make use of complicated mappings created in the `MultiFileReaderData` to correctly handle filters pushed into the reader. Previously the `TableFilterSet` handed to the underlying reader contained "global" filters, i.e the `column_ids` referenced the "global" schema, which is not guaranteed to match the schema of the file that the reader is reading. To convert these into the "local" schema (that of the file that is being read), the `filter_map` had to be used, then the filter entry had to be checked to see if the filter points to a column that is present in the file being read, if it's not then the filter points to the entry in the `constant_map` that should be used instead. ---------------------------------- After this PR, the underlying reader is handed "local" filters, that already contain `column_ids` that reference the "local" schema. No filter will be found that references columns that are not present in the file, as they have been evaluated beforehand. These filters (the ones targeting columns that are not in the file) can be safely removed because if the filter doesn't match we skip the file entirely, it does not have to be scanned. If the filter *does* match, it can be safely ignored by the reader. ---------------------------------- ### Open questions / Areas for future work discovered #### Copied `child_indexes` issue The `child_indexes` of the global `ColumnIndex` are copied 1-to-1 in the creation of the mapping still: ```c++ //! FIXME: local fields are not guaranteed to match with the global fields for this struct local_index = ColumnIndex(local_id.GetId(), global_id.GetChildIndexes()); } ``` As the fixme says, this probably shouldn't happen, because schema evolution could create a divide between the global and local fields for a given struct. #### `DynamicFilter` copying DynamicFilter works similar to a BoundParameterExpression in that it contains a shared_ptr to a shared structure to allow the filter to be updated. This can't really be copied and still function correctly. Perhaps at this stage the DynamicFilter has served its purpose and can be safely transformed into a non-dynamic filter? I did not feel confident in making this decision. #### `StructFilter` schema evolution The StructFilter besides a `child_idx` (which suffers from the issue mentioned above) also has a `child_name`. Through schema evolution the field could be renamed, this information should probably be added to the `MultiFileReaderColumnDefinition` to properly handle renames of struct fields. Or that information is already present, through the LogicalType attached to the column definition, but it needs to be used properly. #### `cast_map` casts The `cast_map` is still present, and necessary to be respected by the underlying reader, because delaying this cast would complicate the filtering logic. If we look at the following flow: `MFR <-> READER <-> FILE` Taking the parquet scanner as an example; Currently, the `cast_map` is applied in the translation from the `FILE` to the `READER`, through a CastColumnReader. This reader isn't anything special, it just applies a DefaultCastAs to the vector read from the reader it wraps, this can easily be replaced by a BoundCastExpression that is applied in the `FinalizeChunk`. The problem with moving this cast up, is that it would be applied in the translation from `READER` to the `MFR`. This is a problem because the filter pushdown is done inside the READER (after the cast is performed), and so the filters are using (and expecting) the "global" type, the target of the cast. To move the cast up into the translation from `READER` to the `MFR` the filters that are sent to the reader would need to be changed to work on the "local" type instead, which is not an easy refactor.

[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630) Revert "fix: drop useless python import" (duckdb/duckdb#16834)

Tishj added 23 commits March 6, 2025 10:54

first step: rework FinalizeChunk prototype, take in an input_chunk an…

1191032

…d an output_chunk

WIP, probably the CreateColumnMapping needs to play into this

bc55d0e

Merge remote-tracking branch 'upstream/main' into multi_file_reader_f…

43af0b5

…inalize_chunk

Merge branch 'multi_file_reader_mapping_demystification' into multi_f…

fe835c6

…ile_reader_finalize_chunk

this is mostly working?

baff4f4

continue if we find a virtual column

dfc4c95

stale references are bad. mkay. You shouldn't have stale references. …

a2c2979

…mkay. If you have stale references you are bad, because stale references are bad. mkay. It's a bad thing to have stale references, so don't be bad by having stale references. mkay. Don't be bad. mkay

putting the chunk inside the MultiFileReaderData is not granular enou…

59226cb

…gh, this could be shared by multiple threads, move it to the MultiFileLocalState instead

undo changes to 'column_mapping', these are needed for translating (l…

accd959

…ocal) col_idx -> global_idx, in the context of filters

don't use the column_mapping in these places

95d939b

fixing some issues

7b9e098

undo the changes to cast map for now, I imagine this will have unfore…

b231418

…seen filter/stats issues if we keep these changes

need to expect the (cast) type in our scan_chunk

6424912

provide the virtual column map to CreateMapping, so we can emit const…

c6c1210

…ants of the right type

undo the changes to the parquet_reader.cpp

15f4f7f

Merge branch 'multi_file_reader_mapping_demystification' into multi_f…

6b76b0a

…ile_reader_finalize_chunk

add missing expression for default value

b228e9e

Merge branch 'multi_file_reader_mapping_demystification' into multi_f…

8951187

…ile_reader_finalize_chunk

whoops, used the wrong index

5ac0011

no need to use MultiFileLocalIndex here

1d2a96c

Merge remote-tracking branch 'upstream/main' into multi_file_reader_f…

a54e3a5

…inalize_chunk

undo an early-return code cleanup

bc7971b

more undoing

01416b6

move the expression executor into the MultiFileLocalState

97cabe0

duckdb-draftbot marked this pull request as draft March 13, 2025 12:26

Tishj marked this pull request as ready for review March 13, 2025 13:48

Merge remote-tracking branch 'upstream/main' into multi_file_reader_f…

a55a352

…inalize_chunk

Tishj marked this pull request as ready for review March 21, 2025 07:55

pdet mentioned this pull request Mar 21, 2025

Basic Multi-File Reading Support paleolimbot/duckdb-nanoarrow#12

Closed

Mytherin reviewed Mar 21, 2025

View reviewed changes

src/include/duckdb/common/multi_file_reader_function.hpp Outdated Show resolved Hide resolved

Tishj added 2 commits March 21, 2025 23:54

initialize as part of TryInitializeNextBatch

2790714

Merge remote-tracking branch 'upstream/main' into multi_file_reader_f…

1a4f266

…inalize_chunk

Tishj marked this pull request as draft March 21, 2025 23:09

Tishj marked this pull request as ready for review March 21, 2025 23:09

duckdb-draftbot marked this pull request as draft March 21, 2025 23:10

Tishj added 2 commits March 22, 2025 16:44

remove outdated patch, add patch for chrono, building delta-kernel-rs…

8ca3536

… - this failed locally without it

nvm this doesnt work

b38db55

Mytherin marked this pull request as ready for review March 25, 2025 11:14

Merge remote-tracking branch 'upstream/main' into multi_file_reader_f…

90dba1d

…inalize_chunk

duckdb-draftbot marked this pull request as draft March 25, 2025 14:15

Tishj marked this pull request as ready for review March 25, 2025 14:17

this should really have no effect on the error at hand, but this was …

223ce89

…incorrectly changed by my PR or a previous PR of mine

duckdb-draftbot marked this pull request as draft March 25, 2025 17:32

Tishj marked this pull request as ready for review March 25, 2025 17:44

Mytherin merged commit ca2172d into duckdb:main Mar 26, 2025
48 of 49 checks passed

Tishj mentioned this pull request Mar 26, 2025

[MultiFileReader] Create "local" filters to hand to underlying readers #16838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions #16630

[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions #16630

Uh oh!

Tishj commented Mar 13, 2025 •

edited

Loading

Uh oh!

Tishj commented Mar 13, 2025 •

edited

Loading

Uh oh!

Mytherin commented Mar 13, 2025

Uh oh!

Tishj commented Mar 13, 2025

Uh oh!

Mytherin commented Mar 13, 2025

Uh oh!

Mytherin left a comment

Uh oh!

Uh oh!

Tishj commented Mar 25, 2025 •

edited

Loading

Uh oh!

Tishj commented Mar 26, 2025

Uh oh!

Uh oh!

Mytherin commented Mar 26, 2025

Uh oh!

Uh oh!

[MultiFileReader] Rework MultiFileReader::FinalizeChunk to use Expressions #16630

[MultiFileReader] Rework MultiFileReader::FinalizeChunk to use Expressions #16630

Uh oh!

Conversation

Tishj commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tishj commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin commented Mar 13, 2025

Uh oh!

Tishj commented Mar 13, 2025

Uh oh!

Mytherin commented Mar 13, 2025

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Tishj commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tishj commented Mar 26, 2025

Uh oh!

Uh oh!

Mytherin commented Mar 26, 2025

Uh oh!

Uh oh!

[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions #16630

[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions #16630

Tishj commented Mar 13, 2025 •

edited

Loading

Tishj commented Mar 13, 2025 •

edited

Loading

Tishj commented Mar 25, 2025 •

edited

Loading