Skip to content

Conversation

Tmonster
Copy link
Contributor

@Tmonster Tmonster commented Feb 13, 2025

fixes https://github.com/duckdblabs/duckdb-internal/issues/4203
fixes #16175

For smaller data sizes and a parallel Bernoulli sample, you could get skewed results. This is because the same seed was used for all threads. So if you have 10 threads and a data size of 1000, then every thread gets ~100 rows. In the example, the sample size was 1%. It's possible the random engine doesn't produce a value <0.01 for the first 100 randomly generated values. This means none of the threads return a value for the result.

The fix is to assign every thread a random seed, but if repeatable is set, then we set ParallelSink to false and we use the seed, guaranteeing a repeatable result.

This PR is similar to how reservoir sampling currently behaves.
related PR
https://github.com/duckdb/duckdb/pull/14797/files

@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 13, 2025 19:03
@Mytherin Mytherin marked this pull request as ready for review February 13, 2025 19:49
@duckdb-draftbot duckdb-draftbot marked this pull request as draft February 17, 2025 09:45
@Tmonster Tmonster marked this pull request as ready for review February 17, 2025 09:45
@Mytherin Mytherin merged commit 52811a9 into duckdb:v1.2-histrionicus Feb 17, 2025
50 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Mar 7, 2025
use random seeds for bernoulli sample when parallel is enabled (duckdb/duckdb#16223)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants