use random seeds for bernoulli sample when parallel is enabled #16223
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes https://github.com/duckdblabs/duckdb-internal/issues/4203
fixes #16175
For smaller data sizes and a parallel Bernoulli sample, you could get skewed results. This is because the same seed was used for all threads. So if you have 10 threads and a data size of 1000, then every thread gets ~100 rows. In the example, the sample size was 1%. It's possible the random engine doesn't produce a value <0.01 for the first 100 randomly generated values. This means none of the threads return a value for the result.
The fix is to assign every thread a random seed, but if repeatable is set, then we set
ParallelSink
to false and we use the seed, guaranteeing a repeatable result.This PR is similar to how reservoir sampling currently behaves.
related PR
https://github.com/duckdb/duckdb/pull/14797/files