Skip to content

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Mar 13, 2025

Previously the underlying reader (such as csv or parquet) would read directly into a chunk that matches the "global" schema, containing columns that are perhaps constants for this file.
Because of this, the column_mapping had to be applied in the underlying reader to find the Vector that should be written into.

This PR replaces that, letting the underlying reader be unaware of the global schema, moving this mapping logic up to the FinalizeChunk method of the MultiFileReader implementation.

I was hoping to remove the need to expose the column_mapping to the underlying reader entirely, along with constant_map, cast_map and filter_map, but this is not possible without a bigger rework.

EDIT:
A follow up PR is in the works to make the underlying reader unaware of the global schema when it comes to filters

Tishj added 23 commits March 6, 2025 10:54
…mkay. If you have stale references you are bad, because stale references are bad. mkay. It's a bad thing to have stale references, so don't be bad by having stale references. mkay. Don't be bad. mkay
…gh, this could be shared by multiple threads, move it to the MultiFileLocalState instead
…ocal) col_idx -> global_idx, in the context of filters
…seen filter/stats issues if we keep these changes
@Tishj
Copy link
Contributor Author

Tishj commented Mar 13, 2025

@Mytherin we talked about the ExpressionExecutor construction/initialization being wasteful inside FinalizeChunk but I don't think it's possible to fix this, the ExpressionExecutor is not safe to share, because its state is mutable.
There could be multiple threads executing FinalizeChunk on chunks that originates from the same source file at the same time, so this can't be placed in the MultiFileReaderData.

Perhaps it can be moved into the MultiFileScan body, so the FinalizeChunk method requires a ExpressionExecutor & to be passed in?

@Mytherin
Copy link
Collaborator

It can't be shared indeed - it needs to be in MultiFileLocalState as part of the thread-local table function state. It can be initialized in TryInitializeNextBatch and passed into FinalizeChunk.

@Tishj
Copy link
Contributor Author

Tishj commented Mar 13, 2025

It can't be shared indeed - it needs to be in MultiFileLocalState as part of the thread-local table function state. It can be initialized in TryInitializeNextBatch and passed into FinalizeChunk.

That makes sense, so the construction of the ExpressionExecutor is what we're most concerned with, not necessarily the AddExpression calls?
(while also limiting the add expression calls to only be done once per file handled by the thread)

So we don't attach the expression executor to the lifetime of the file, we attach it to that of the thread (local) state?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 13, 2025 12:26
@Mytherin
Copy link
Collaborator

No the AddExpression calls are the most expensive since that sets up all of the expression states/buffers - but once per file is fine.

@Tishj Tishj marked this pull request as ready for review March 13, 2025 13:48
@Tishj Tishj marked this pull request as ready for review March 21, 2025 07:55
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - one more comment then this is good to go

@Tishj Tishj marked this pull request as draft March 21, 2025 23:09
@Tishj Tishj marked this pull request as ready for review March 21, 2025 23:09
@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 21, 2025 23:10
@Mytherin Mytherin marked this pull request as ready for review March 25, 2025 11:14
@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 25, 2025 14:15
@Tishj Tishj marked this pull request as ready for review March 25, 2025 14:17
@Tishj
Copy link
Contributor Author

Tishj commented Mar 25, 2025

Run ./build/release/duckdb -c "PRAGMA platform;"
┌──────────────────┐
│     platform     │
│     varchar      │
├──────────────────┤
│ linux_amd64_gcc4 │
└──────────────────┘
2025-03-25T16:13:49.1845660Z [1/1] (100%): test/sql/copy/csv/afl/fuzz_20250211_crash.test                    
2025-03-25T16:13:49.1846009Z ===============================================================================
2025-03-25T16:13:49.1846273Z test cases: 1 | 1 failed
2025-03-25T16:13:49.1846492Z assertions: 2 | 1 passed | 1 failed
2025-03-25T16:13:49.1846665Z 
2025-03-25T16:13:49.1846669Z 
2025-03-25T16:13:49.1846739Z --------------------
2025-03-25T16:13:49.1847131Z STDERR
2025-03-25T16:13:49.1847304Z --------------------
2025-03-25T16:13:49.1847514Z ================================================================================
2025-03-25T16:13:49.1847927Z Query failed with internal exception! (test/sql/copy/csv/afl/fuzz_20250211_crash.test:8)!
2025-03-25T16:13:49.1848343Z ================================================================================
2025-03-25T16:13:49.1848726Z FROM read_csv('data/csv/afl/20250211_csv_fuzz_crash/case_53.csv', buffer_size=42);
2025-03-25T16:13:49.1849080Z Actual result:
2025-03-25T16:13:49.1849277Z ================================================================================
2025-03-25T16:13:49.1849623Z INTERNAL Error: Attempted to access index 3 within vector of size 3
2025-03-25T16:13:49.1849876Z 
2025-03-25T16:13:49.1849954Z Stack Trace:
2025-03-25T16:13:49.1850061Z 
2025-03-25T16:13:49.1850582Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::Exception::ToJSON(duckdb::ExceptionType, std::string const&)+0x53) [0x7efd101e0883]
2025-03-25T16:13:49.1851645Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::Exception::Exception(duckdb::ExceptionType, std::string const&)+0x16) [0x7efd101e08b6]
2025-03-25T16:13:49.1852888Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::InternalException::InternalException(std::string const&)+0x11) [0x7efd101e2f01]
2025-03-25T16:13:49.1853959Z build/release/test/unittest(duckdb::InternalException::InternalException<unsigned long, unsigned long>(std::string const&, unsigned long, unsigned long)+0x137) [0x67da17]
2025-03-25T16:13:49.1854794Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(+0x975282) [0x7efd0f775282]
2025-03-25T16:13:49.1855409Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(+0x15ceefe) [0x7efd103ceefe]
2025-03-25T16:13:49.1856203Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::StringValueScanner::Flush(duckdb::DataChunk&)+0xb1) [0x7efd103cf191]
2025-03-25T16:13:49.1859526Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::MultiFileReaderFunction<duckdb::CSVMultiFileInfo>::MultiFileScan(duckdb::ClientContext&, duckdb::TableFunctionInput&, duckdb::DataChunk&)+0xae) [0x7efd0fe86b8e]
2025-03-25T16:13:49.1862116Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::PhysicalTableScan::GetData(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSourceInput&) const+0x52) [0x7efd104aa7c2]
2025-03-25T16:13:49.1864253Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::PipelineExecutor::FetchFromSource(duckdb::DataChunk&)+0x78) [0x7efd1063df48]
2025-03-25T16:13:49.1865457Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::PipelineExecutor::Execute(unsigned long)+0x12b) [0x7efd1064929b]
2025-03-25T16:13:49.1866643Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode)+0xe4) [0x7efd10649524]
2025-03-25T16:13:49.1867899Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::ExecutorTask::Execute(duckdb::TaskExecutionMode)+0xce) [0x7efd1063fb9e]
2025-03-25T16:13:49.1868911Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(duckdb::TaskScheduler::ExecuteForever(std::atomic<bool>*)+0x127) [0x7efd10647fb7]
2025-03-25T16:13:49.1869711Z /home/runner/work/duckdb/duckdb/build/release/src/libduckdb.so(+0x2455310) [0x7efd11255310]
2025-03-25T16:13:49.1870201Z /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7efd0e69caa4]
2025-03-25T16:13:49.1870550Z /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7efd0e729c3c]
2025-03-25T16:13:49.1870762Z 
2025-03-25T16:13:49.1871141Z This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
2025-03-25T16:13:49.1871788Z For more information, see https://duckdb.org/docs/dev/internal_errors

This is the same issue as https://github.com/duckdblabs/duckdb-internal/issues/4499

…incorrectly changed by my PR or a previous PR of mine
@duckdb-draftbot duckdb-draftbot marked this pull request as draft March 25, 2025 17:32
@Tishj Tishj marked this pull request as ready for review March 25, 2025 17:44
@Tishj
Copy link
Contributor Author

Tishj commented Mar 26, 2025

Now its an http failure @Mytherin

@Mytherin Mytherin merged commit ca2172d into duckdb:main Mar 26, 2025
48 of 49 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

Mytherin added a commit that referenced this pull request Mar 27, 2025
#16838)

This PR is a follow up to #16630 

The aim of this PR is to not require the underlying reader to make use
of complicated mappings created in the `MultiFileReaderData` to
correctly handle filters pushed into the reader.

Previously the `TableFilterSet` handed to the underlying reader
contained "global" filters, i.e the `column_ids` referenced the "global"
schema, which is not guaranteed to match the schema of the file that the
reader is reading.

To convert these into the "local" schema (that of the file that is being
read), the `filter_map` had to be used, then the filter entry had to be
checked to see if the filter points to a column that is present in the
file being read, if it's not then the filter points to the entry in the
`constant_map` that should be used instead.

----------------------------------

After this PR, the underlying reader is handed "local" filters, that
already contain `column_ids` that reference the "local" schema.
No filter will be found that references columns that are not present in
the file, as they have been evaluated beforehand.

These filters (the ones targeting columns that are not in the file) can
be safely removed because if the filter doesn't match we skip the file
entirely, it does not have to be scanned. If the filter *does* match, it
can be safely ignored by the reader.

----------------------------------

### Open questions / Areas for future work discovered

#### Copied `child_indexes` issue
The `child_indexes` of the global `ColumnIndex` are copied 1-to-1 in the
creation of the mapping still:
```c++
			//! FIXME: local fields are not guaranteed to match with the global fields for this struct
			local_index = ColumnIndex(local_id.GetId(), global_id.GetChildIndexes());
		}
```
As the fixme says, this probably shouldn't happen, because schema
evolution could create a divide between the global and local fields for
a given struct.

#### `DynamicFilter` copying
DynamicFilter works similar to a BoundParameterExpression in that it
contains a shared_ptr to a shared structure to allow the filter to be
updated. This can't really be copied and still function correctly.
Perhaps at this stage the DynamicFilter has served its purpose and can
be safely transformed into a non-dynamic filter?
I did not feel confident in making this decision.

#### `StructFilter` schema evolution
The StructFilter besides a `child_idx` (which suffers from the issue
mentioned above) also has a `child_name`.
Through schema evolution the field could be renamed, this information
should probably be added to the `MultiFileReaderColumnDefinition` to
properly handle renames of struct fields.
Or that information is already present, through the LogicalType attached
to the column definition, but it needs to be used properly.

#### `cast_map` casts
The `cast_map` is still present, and necessary to be respected by the
underlying reader, because delaying this cast would complicate the
filtering logic.

If we look at the following flow:
`MFR <-> READER <-> FILE`

Taking the parquet scanner as an example;
Currently, the `cast_map` is applied in the translation from the `FILE`
to the `READER`, through a CastColumnReader.
This reader isn't anything special, it just applies a DefaultCastAs to
the vector read from the reader it wraps, this can easily be replaced by
a BoundCastExpression that is applied in the `FinalizeChunk`.

The problem with moving this cast up, is that it would be applied in the
translation from `READER` to the `MFR`.
This is a problem because the filter pushdown is done inside the READER
(after the cast is performed), and so the filters are using (and
expecting) the "global" type, the target of the cast.

To move the cast up into the translation from `READER` to the `MFR` the
filters that are sent to the reader would need to be changed to work on
the "local" type instead, which is not an easy refactor.
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630)
Revert "fix: drop useless python import" (duckdb/duckdb#16834)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630)
Revert "fix: drop useless python import" (duckdb/duckdb#16834)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630)
Revert "fix: drop useless python import" (duckdb/duckdb#16834)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630)
Revert "fix: drop useless python import" (duckdb/duckdb#16834)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 16, 2025
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630)
Revert "fix: drop useless python import" (duckdb/duckdb#16834)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
[MultiFileReader] Rework `MultiFileReader::FinalizeChunk` to use Expressions (duckdb/duckdb#16630)
Revert "fix: drop useless python import" (duckdb/duckdb#16834)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants