Skip to content

Conversation

Mytherin
Copy link
Collaborator

No description provided.

ywelsch and others added 30 commits February 12, 2025 13:15
We had two users crash with the following backtrace:

```
    frame #0: 0x0000ffffab2571ec
    frame #1: 0x0000aaaaac00c5fc duckling`duckdb::InternalException::InternalException(this=<unavailable>, msg=<unavailable>) at exception.cpp:328:2
    frame #2: 0x0000aaaaac1ee418 duckling`duckdb::optional_ptr<duckdb::OptimisticDataWriter, true>::CheckValid(this=<unavailable>) const at optional_ptr.hpp:34:11
    frame #3: 0x0000aaaaac1eea8c duckling`duckdb::MergeCollectionTask::Execute(duckdb::PhysicalBatchInsert const&, duckdb::ClientContext&, duckdb::GlobalSinkState&, duckdb::LocalSinkState&) [inlined] duckdb::optional_ptr<duckdb::OptimisticDataWriter, true>::operator*(this=<unavailable>) at optional_ptr.hpp:43:3
    frame #4: 0x0000aaaaac1eea84 duckling`duckdb::MergeCollectionTask::Execute(this=0x0000aaaaf1b06150, op=<unavailable>, context=0x0000aaaba820d8d0, gstate_p=0x0000aaab06880f00, lstate_p=<unavailable>) at physical_batch_insert.cpp:219:90
    frame #5: 0x0000aaaaac1d2e10 duckling`duckdb::PhysicalBatchInsert::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const [inlined] duckdb::PhysicalBatchInsert::ExecuteTask(this=0x0000aaaafa62ab40, context=<unavailable>, gstate_p=0x0000aaab06880f00, lstate_p=0x0000aab12d442960) const at physical_batch_insert.cpp:425:8
    frame #6: 0x0000aaaaac1d2dd8 duckling`duckdb::PhysicalBatchInsert::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const [inlined] duckdb::PhysicalBatchInsert::ExecuteTasks(this=0x0000aaaafa62ab40, context=<unavailable>, gstate_p=0x0000aaab06880f00, lstate_p=0x0000aab12d442960) const at physical_batch_insert.cpp:431:9
    frame #7: 0x0000aaaaac1d2dd8 duckling`duckdb::PhysicalBatchInsert::Sink(this=0x0000aaaafa62ab40, context=0x0000aab2fffd7cb0, chunk=<unavailable>, input=<unavailable>) const at physical_batch_insert.cpp:494:4
    frame #8: 0x0000aaaaac353158 duckling`duckdb::PipelineExecutor::ExecutePushInternal(duckdb::DataChunk&, duckdb::ExecutionBudget&, unsigned long) [inlined] duckdb::PipelineExecutor::Sink(this=0x0000aab2fffd7c00, chunk=0x0000aab2fffd7d30, input=0x0000fffec0aba8d8) at pipeline_executor.cpp:521:24
    frame #9: 0x0000aaaaac353130 duckling`duckdb::PipelineExecutor::ExecutePushInternal(this=0x0000aab2fffd7c00, input=0x0000aab2fffd7d30, chunk_budget=0x0000fffec0aba980, initial_idx=0) at pipeline_executor.cpp:332:23
    frame #10: 0x0000aaaaac34f7b4 duckling`duckdb::PipelineExecutor::Execute(this=0x0000aab2fffd7c00, max_chunks=<unavailable>) at pipeline_executor.cpp:201:13
    frame #11: 0x0000aaaaac34f258 duckling`duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) [inlined] duckdb::PipelineExecutor::Execute(this=<unavailable>) at pipeline_executor.cpp:278:9
    frame duckdb#12: 0x0000aaaaac34f250 duckling`duckdb::PipelineTask::ExecuteTask(this=0x0000aab16dafd630, mode=<unavailable>) at pipeline.cpp:51:33
    frame duckdb#13: 0x0000aaaaac348298 duckling`duckdb::ExecutorTask::Execute(this=0x0000aab16dafd630, mode=<unavailable>) at executor_task.cpp:49:11
    frame duckdb#14: 0x0000aaaaac356600 duckling`duckdb::TaskScheduler::ExecuteForever(this=0x0000aaaaf0105560, marker=0x0000aaaaf00ee578) at task_scheduler.cpp:189:32
    frame duckdb#15: 0x0000ffffab0a31fc
    frame duckdb#16: 0x0000ffffab2ad5c8
```

Core dump analysis showed that the assertion `D_ASSERT(lstate.writer);`
in `MergeCollectionTask::Execute` (i.e. it is crashing because
`lstate.writer` is NULLPTR) was not satisfied when
`PhysicalBatchInsert::Sink` was processing merge tasks from (other)
pipeline executors.

My suspicion is that this is only likely to happen for heavily
concurrent workloads (applicable to the two users which crashed). The
patch submitted as part of this PR has addressed the issue for these
users.
Having a non-deleted copy constructor allows me to write the following
code, which crashes

```cpp
DuckDB duckdb(nullptr);
Connection con(duckdb);
auto pending = con.PendingQuery("FROM range(1000)");
auto pending2 = *pending;

pending2.Execute(); // Crashes -> Attempting to execute an unsuccessful or closed pending query result
```
* Use two cursors for range searches
* Reduces benchmark time from 35s to 25s
…b#16223)

fixes duckdblabs/duckdb-internal#4203
fixes duckdb#16175

For smaller data sizes and a parallel Bernoulli sample, you could get
skewed results. This is because the same seed was used for all threads.
So if you have 10 threads and a data size of 1000, then every thread
gets ~100 rows. In the example, the sample size was 1%. It's possible
the random engine doesn't produce a value <0.01 for the first 100
randomly generated values. This means none of the threads return a value
for the result.

The fix is to assign every thread a random seed, but if repeatable is
set, then we set `ParallelSink` to false and we use the seed,
guaranteeing a repeatable result.

This PR is similar to how reservoir sampling currently behaves. 
related PR
https://github.com/duckdb/duckdb/pull/14797/files
pdet and others added 12 commits February 17, 2025 15:25
…er (duckdb#16263)

Currently we keep the compressed buffer around, when we really only need
it for a short amount of time (when we read data to then decompress it).
Since we keep the compressed buffer around per column, these resizeable
buffers can grow to a rather substantial amount of memory, which can
lead to us using 50-100% extra memory when reading Parquet files.
Destroying and re-allocating the buffer does not seem to greatly affect
performance - below are ClickBench timings ran on my laptop on a single
big file (which is when caching matters the most given that caches are
re-instantiated between files anyway):

We can perhaps do something more clever where we keep around a single
cache that we share across readers, but not for the bug-fix release.

| Query |  Old  |  New  | Ratio |
|-------|------:|------:|------:|
| Q00   | 0.165 | 0.156 | 0.95  |
| Q01   | 0.156 | 0.155 | 0.99  |
| Q02   | 0.192 | 0.191 | 0.99  |
| Q03   | 0.179 | 0.188 | 1.05  |
| Q04   | 0.292 | 0.287 | 0.98  |
| Q05   | 0.364 | 0.349 | 0.96  |
| Q06   | 0.154 | 0.157 | 1.02  |
| Q07   | 0.158 | 0.159 | 1.01  |
| Q08   | 0.361 | 0.348 | 0.96  |
| Q09   | 0.449 | 0.440 | 0.98  |
| Q10   | 0.225 | 0.220 | 0.98  |
| Q11   | 0.251 | 0.249 | 0.99  |
| Q12   | 0.379 | 0.355 | 0.94  |
| Q13   | 0.552 | 0.539 | 0.98  |
| Q14   | 0.426 | 0.412 | 0.97  |
| Q15   | 0.351 | 0.325 | 0.93  |
| Q16   | 0.665 | 0.629 | 0.95  |
| Q17   | 0.617 | 0.596 | 0.97  |
| Q18   | 0.971 | 0.947 | 0.98  |
| Q19   | 0.215 | 0.196 | 0.91  |
| Q20   | 0.929 | 0.892 | 0.96  |
| Q21   | 0.730 | 0.719 | 0.98  |
| Q22   | 1.213 | 1.226 | 1.01  |
| Q23   | 3.062 | 2.905 | 0.95  |
| Q24   | 0.189 | 0.189 | 1.00  |
| Q25   | 0.178 | 0.172 | 0.97  |
| Q26   | 0.201 | 0.186 | 0.93  |
| Q27   | 0.771 | 0.729 | 0.95  |
| Q28   | 7.475 | 7.246 | 0.97  |
| Q29   | 0.254 | 0.236 | 0.93  |
| Q30   | 0.483 | 0.477 | 0.99  |
| Q31   | 0.508 | 0.549 | 1.08  |
| Q32   | 1.168 | 1.254 | 1.07  |
| Q33   | 1.431 | 1.493 | 1.04  |
| Q34   | 1.453 | 1.511 | 1.04  |
| Q35   | 0.359 | 0.369 | 1.03  |
| Q36   | 0.165 | 0.184 | 1.12  |
| Q37   | 0.161 | 0.166 | 1.03  |
| Q38   | 0.173 | 0.172 | 0.99  |
| Q39   | 0.177 | 0.182 | 1.03  |
| Q40   | 0.165 | 0.169 | 1.02  |
| Q41   | 0.162 | 0.172 | 1.06  |
…nemap (duckdb#16269)

During row group checks - we keep track of which filters are always true
(and hence do not need to be checked during the actual scan). Due to a
missing `else` statement, we could call this method twice for the same
filter.

Now this doesn't look problematic, but in `SetFilterAlwaysTrue` we
*also* keep track of how many filters have been set as being always true
- and if all filters are set as being always true, the filters are
skipped entirely.

This can cause problems when optional filters (as e.g. constructed due
to a Top-N) are combined with non-optional filters (as e.g. constructed
through a `WHERE` clause). The filter would not be correctly executed -
leading to incorrect rows sneaking into the result.
* Check hints for equality and skip search.

fixes: duckdb#16250
fixes: duckdblabs/duckdb-internal#4229
Fix duckdb#16257

I've attached the diff that needs to happen to `main` when we eventually
need to merge this branch into it, as the Parquet writer has been
refactored significantly in the meantime.


[page_fix.txt](https://github.com/user-attachments/files/18827335/page_fix.txt)
Basically cherry-picking the commits of
duckdb#16154 to the `v1.2` branch, and
reducing the number of tests to a minimal set.
* Use two cursors for range searches
* Reduces benchmark time from 35s to 25s
@Mytherin Mytherin merged commit 138639a into duckdb:main Feb 18, 2025
47 checks passed
Antonov548 added a commit to Antonov548/duckdb-r that referenced this pull request Mar 4, 2025
krlmlr pushed a commit to duckdb/duckdb-r that referenced this pull request Mar 5, 2025
@Mytherin Mytherin deleted the mergeintomain branch April 2, 2025 09:24
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 17, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants