Merge v1.2-histrionicus into main #16284

Mytherin · 2025-02-18T10:34:40Z

No description provided.

We had two users crash with the following backtrace: ``` frame #0: 0x0000ffffab2571ec frame #1: 0x0000aaaaac00c5fc duckling`duckdb::InternalException::InternalException(this=<unavailable>, msg=<unavailable>) at exception.cpp:328:2 frame #2: 0x0000aaaaac1ee418 duckling`duckdb::optional_ptr<duckdb::OptimisticDataWriter, true>::CheckValid(this=<unavailable>) const at optional_ptr.hpp:34:11 frame #3: 0x0000aaaaac1eea8c duckling`duckdb::MergeCollectionTask::Execute(duckdb::PhysicalBatchInsert const&, duckdb::ClientContext&, duckdb::GlobalSinkState&, duckdb::LocalSinkState&) [inlined] duckdb::optional_ptr<duckdb::OptimisticDataWriter, true>::operator*(this=<unavailable>) at optional_ptr.hpp:43:3 frame #4: 0x0000aaaaac1eea84 duckling`duckdb::MergeCollectionTask::Execute(this=0x0000aaaaf1b06150, op=<unavailable>, context=0x0000aaaba820d8d0, gstate_p=0x0000aaab06880f00, lstate_p=<unavailable>) at physical_batch_insert.cpp:219:90 frame #5: 0x0000aaaaac1d2e10 duckling`duckdb::PhysicalBatchInsert::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const [inlined] duckdb::PhysicalBatchInsert::ExecuteTask(this=0x0000aaaafa62ab40, context=<unavailable>, gstate_p=0x0000aaab06880f00, lstate_p=0x0000aab12d442960) const at physical_batch_insert.cpp:425:8 frame #6: 0x0000aaaaac1d2dd8 duckling`duckdb::PhysicalBatchInsert::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const [inlined] duckdb::PhysicalBatchInsert::ExecuteTasks(this=0x0000aaaafa62ab40, context=<unavailable>, gstate_p=0x0000aaab06880f00, lstate_p=0x0000aab12d442960) const at physical_batch_insert.cpp:431:9 frame #7: 0x0000aaaaac1d2dd8 duckling`duckdb::PhysicalBatchInsert::Sink(this=0x0000aaaafa62ab40, context=0x0000aab2fffd7cb0, chunk=<unavailable>, input=<unavailable>) const at physical_batch_insert.cpp:494:4 frame #8: 0x0000aaaaac353158 duckling`duckdb::PipelineExecutor::ExecutePushInternal(duckdb::DataChunk&, duckdb::ExecutionBudget&, unsigned long) [inlined] duckdb::PipelineExecutor::Sink(this=0x0000aab2fffd7c00, chunk=0x0000aab2fffd7d30, input=0x0000fffec0aba8d8) at pipeline_executor.cpp:521:24 frame #9: 0x0000aaaaac353130 duckling`duckdb::PipelineExecutor::ExecutePushInternal(this=0x0000aab2fffd7c00, input=0x0000aab2fffd7d30, chunk_budget=0x0000fffec0aba980, initial_idx=0) at pipeline_executor.cpp:332:23 frame #10: 0x0000aaaaac34f7b4 duckling`duckdb::PipelineExecutor::Execute(this=0x0000aab2fffd7c00, max_chunks=<unavailable>) at pipeline_executor.cpp:201:13 frame #11: 0x0000aaaaac34f258 duckling`duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) [inlined] duckdb::PipelineExecutor::Execute(this=<unavailable>) at pipeline_executor.cpp:278:9 frame duckdb#12: 0x0000aaaaac34f250 duckling`duckdb::PipelineTask::ExecuteTask(this=0x0000aab16dafd630, mode=<unavailable>) at pipeline.cpp:51:33 frame duckdb#13: 0x0000aaaaac348298 duckling`duckdb::ExecutorTask::Execute(this=0x0000aab16dafd630, mode=<unavailable>) at executor_task.cpp:49:11 frame duckdb#14: 0x0000aaaaac356600 duckling`duckdb::TaskScheduler::ExecuteForever(this=0x0000aaaaf0105560, marker=0x0000aaaaf00ee578) at task_scheduler.cpp:189:32 frame duckdb#15: 0x0000ffffab0a31fc frame duckdb#16: 0x0000ffffab2ad5c8 ``` Core dump analysis showed that the assertion `D_ASSERT(lstate.writer);` in `MergeCollectionTask::Execute` (i.e. it is crashing because `lstate.writer` is NULLPTR) was not satisfied when `PhysicalBatchInsert::Sink` was processing merge tasks from (other) pipeline executors. My suspicion is that this is only likely to happen for heavily concurrent workloads (applicable to the two users which crashed). The patch submitted as part of this PR has addressed the issue for these users.

Having a non-deleted copy constructor allows me to write the following code, which crashes ```cpp DuckDB duckdb(nullptr); Connection con(duckdb); auto pending = con.PendingQuery("FROM range(1000)"); auto pending2 = *pending; pending2.Execute(); // Crashes -> Attempting to execute an unsuccessful or closed pending query result ```

…b#16253) Errors are only reported if `has_error` is true, which is only set for `get_api` but not for `get_database`. https://github.com/duckdb/duckdb/blob/5f5512b827df6397afd31daedb4bbdee76520019/src/main/extension/extension_load.cpp#L570-L574

* Use two cursors for range searches * Reduces benchmark time from 35s to 25s

…b#16223) fixes duckdblabs/duckdb-internal#4203 fixes duckdb#16175 For smaller data sizes and a parallel Bernoulli sample, you could get skewed results. This is because the same seed was used for all threads. So if you have 10 threads and a data size of 1000, then every thread gets ~100 rows. In the example, the sample size was 1%. It's possible the random engine doesn't produce a value <0.01 for the first 100 randomly generated values. This means none of the threads return a value for the result. The fix is to assign every thread a random seed, but if repeatable is set, then we set `ParallelSink` to false and we use the seed, guaranteeing a repeatable result. This PR is similar to how reservoir sampling currently behaves. related PR https://github.com/duckdb/duckdb/pull/14797/files

…nemap

…lias instead of by index

…ximum line size and buffer size

…er (duckdb#16263) Currently we keep the compressed buffer around, when we really only need it for a short amount of time (when we read data to then decompress it). Since we keep the compressed buffer around per column, these resizeable buffers can grow to a rather substantial amount of memory, which can lead to us using 50-100% extra memory when reading Parquet files. Destroying and re-allocating the buffer does not seem to greatly affect performance - below are ClickBench timings ran on my laptop on a single big file (which is when caching matters the most given that caches are re-instantiated between files anyway): We can perhaps do something more clever where we keep around a single cache that we share across readers, but not for the bug-fix release. | Query | Old | New | Ratio | |-------|------:|------:|------:| | Q00 | 0.165 | 0.156 | 0.95 | | Q01 | 0.156 | 0.155 | 0.99 | | Q02 | 0.192 | 0.191 | 0.99 | | Q03 | 0.179 | 0.188 | 1.05 | | Q04 | 0.292 | 0.287 | 0.98 | | Q05 | 0.364 | 0.349 | 0.96 | | Q06 | 0.154 | 0.157 | 1.02 | | Q07 | 0.158 | 0.159 | 1.01 | | Q08 | 0.361 | 0.348 | 0.96 | | Q09 | 0.449 | 0.440 | 0.98 | | Q10 | 0.225 | 0.220 | 0.98 | | Q11 | 0.251 | 0.249 | 0.99 | | Q12 | 0.379 | 0.355 | 0.94 | | Q13 | 0.552 | 0.539 | 0.98 | | Q14 | 0.426 | 0.412 | 0.97 | | Q15 | 0.351 | 0.325 | 0.93 | | Q16 | 0.665 | 0.629 | 0.95 | | Q17 | 0.617 | 0.596 | 0.97 | | Q18 | 0.971 | 0.947 | 0.98 | | Q19 | 0.215 | 0.196 | 0.91 | | Q20 | 0.929 | 0.892 | 0.96 | | Q21 | 0.730 | 0.719 | 0.98 | | Q22 | 1.213 | 1.226 | 1.01 | | Q23 | 3.062 | 2.905 | 0.95 | | Q24 | 0.189 | 0.189 | 1.00 | | Q25 | 0.178 | 0.172 | 0.97 | | Q26 | 0.201 | 0.186 | 0.93 | | Q27 | 0.771 | 0.729 | 0.95 | | Q28 | 7.475 | 7.246 | 0.97 | | Q29 | 0.254 | 0.236 | 0.93 | | Q30 | 0.483 | 0.477 | 0.99 | | Q31 | 0.508 | 0.549 | 1.08 | | Q32 | 1.168 | 1.254 | 1.07 | | Q33 | 1.431 | 1.493 | 1.04 | | Q34 | 1.453 | 1.511 | 1.04 | | Q35 | 0.359 | 0.369 | 1.03 | | Q36 | 0.165 | 0.184 | 1.12 | | Q37 | 0.161 | 0.166 | 1.03 | | Q38 | 0.173 | 0.172 | 0.99 | | Q39 | 0.177 | 0.182 | 1.03 | | Q40 | 0.165 | 0.169 | 1.02 | | Q41 | 0.162 | 0.172 | 1.06 |

…16264) Fixes duckdb#16260

…nemap (duckdb#16269) During row group checks - we keep track of which filters are always true (and hence do not need to be checked during the actual scan). Due to a missing `else` statement, we could call this method twice for the same filter. Now this doesn't look problematic, but in `SetFilterAlwaysTrue` we *also* keep track of how many filters have been set as being always true - and if all filters are set as being always true, the filters are skipped entirely. This can cause problems when optional filters (as e.g. constructed due to a Top-N) are combined with non-optional filters (as e.g. constructed through a `WHERE` clause). The filter would not be correctly executed - leading to incorrect rows sneaking into the result.

Fixes duckdb#16074

…lias instead of by index (duckdb#16272) Fix duckdb#16231

* Check hints for equality and skip search. fixes: duckdb#16250 fixes: duckdblabs/duckdb-internal#4229

Fix duckdb#16257 I've attached the diff that needs to happen to `main` when we eventually need to merge this branch into it, as the Parquet writer has been refactored significantly in the meantime. [page_fix.txt](https://github.com/user-attachments/files/18827335/page_fix.txt)

Basically cherry-picking the commits of duckdb#16154 to the `v1.2` branch, and reducing the number of tests to a minimal set.

* Use two cursors for range searches * Reduces benchmark time from 35s to 25s

Merge v1.2-histrionicus into main (duckdb/duckdb#16284)

ywelsch and others added 30 commits February 12, 2025 13:15

Ensure MergeCollectionTask has a writer

12e51ba

use random seeds for bernoulli sample when parallel is enabled

9ca9ca9

Backport duckdb#16115

b3a67cd

format-fix

cd514e9

Backport duckdb#16115 (duckdb#16227)

d2cec3e

Deleted copy constructor of pending query

5f27683

format fix

20a0961

Report errors caused by get_database in C extensions

141c449

Simplify SetError

a46237b

Issue duckdb#16250: Window Range Performance

381f75e

* Use two cursors for range searches * Reduces benchmark time from 35s to 25s

whenever seed is set, parallel sink is false

3c90da4

Avoid caching the compressed buffer in the ColumnReader

b27267e

Fix duckdb#16260: correctly handle parameters in getvariable

7d720df

Handle macros as well

9c3cd8a

Avoid calling SetFilterAlwaysTrue multiple times in RowGroup::CheckZo…

f066290

…nemap

Add safeguard to SetFilterAlwaysTrue

6637b90

fix scanning from normal leaf to nested leaf

fe56c8f

Fix duckdb#16231: refer to order by condition in ARRAY(SUBQUERY) by a…

1d06c91

…lias instead of by index

fix duckdb#16257

16f1151

Adding fuzzer tests

81bd903

Fix for buffer_size=1 in encoding, and check for conflicts between ma…

9c74515

…ximum line size and buffer size

Adjust a couple more tests

7cba0a9

More tests and fixes

fb12980

Add hangs

b2ac033

This needs to be slightly bigger for windows

0110b24

Format

28fd731

pdet and others added 12 commits February 17, 2025 15:25

Verify that the table names are valid

a04d906

Drastrically minimize the number of tests

36538e6

Fix duckdb#16260: correctly handle parameters in getvariable (duckdb#…

df68ebd

…16264) Fixes duckdb#16260

[Fix] Scanning from normal leaf to nested leaf (duckdb#16270)

4c8d1f9

Fixes duckdb#16074

Fix duckdb#16231: refer to order by condition in ARRAY(SUBQUERY) by a…

cfafb10

…lias instead of by index (duckdb#16272) Fix duckdb#16231

Issue duckdb#16250: Window Range Performance

aadd543

* Check hints for equality and skip search. fixes: duckdb#16250 fixes: duckdblabs/duckdb-internal#4229

AFL Tests for the CSV reader (duckdb#16280)

e1d1131

Basically cherry-picking the commits of duckdb#16154 to the `v1.2` branch, and reducing the number of tests to a minimal set.

Issue duckdb#16250: Window Range Performance (duckdb#16276)

42859d2

* Use two cursors for range searches * Reduces benchmark time from 35s to 25s

Merge branch 'v1.2-histrionicus'

5c0fc80

Mytherin merged commit 138639a into duckdb:main Feb 18, 2025
47 checks passed

Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Mar 4, 2025

vendor: Update vendored sources to duckdb/duckdb@138639a

7bff119

Merge v1.2-histrionicus into main (duckdb/duckdb#16284)

krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025

vendor: Update vendored sources to duckdb/duckdb@138639a

5c7de63

Merge v1.2-histrionicus into main (duckdb/duckdb#16284)

Mytherin deleted the mergeintomain branch April 2, 2025 09:24

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@138639a

45755d9

Merge v1.2-histrionicus into main (duckdb/duckdb#16284)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025

vendor: Update vendored sources to duckdb/duckdb@138639a

73b7a58

Merge v1.2-histrionicus into main (duckdb/duckdb#16284)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025

vendor: Update vendored sources to duckdb/duckdb@138639a

4d5e222

Merge v1.2-histrionicus into main (duckdb/duckdb#16284)

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@138639a

926c69a

Merge v1.2-histrionicus into main (duckdb/duckdb#16284)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge v1.2-histrionicus into main #16284

Merge v1.2-histrionicus into main #16284

Uh oh!

Mytherin commented Feb 18, 2025

Uh oh!

Uh oh!

Uh oh!

Merge v1.2-histrionicus into main #16284

Merge v1.2-histrionicus into main #16284

Uh oh!

Conversation

Mytherin commented Feb 18, 2025

Uh oh!

Uh oh!

Uh oh!