use random seeds for bernoulli sample when parallel is enabled #16223

Tmonster · 2025-02-13T12:22:52Z

fixes https://github.com/duckdblabs/duckdb-internal/issues/4203
fixes #16175

For smaller data sizes and a parallel Bernoulli sample, you could get skewed results. This is because the same seed was used for all threads. So if you have 10 threads and a data size of 1000, then every thread gets ~100 rows. In the example, the sample size was 1%. It's possible the random engine doesn't produce a value <0.01 for the first 100 randomly generated values. This means none of the threads return a value for the result.

The fix is to assign every thread a random seed, but if repeatable is set, then we set ParallelSink to false and we use the seed, guaranteeing a repeatable result.

This PR is similar to how reservoir sampling currently behaves.
related PR
https://github.com/duckdb/duckdb/pull/14797/files

Mytherin · 2025-02-17T13:16:36Z

Thanks!

use random seeds for bernoulli sample when parallel is enabled (duckdb/duckdb#16223)

use random seeds for bernoulli sample when parallel is enabled

9ca9ca9

szarnyasg mentioned this pull request Feb 13, 2025

Bernoulli sample gives strange results in CTE with DISTINCT #16175

Closed

2 tasks

format-fix

cd514e9

duckdb-draftbot marked this pull request as draft February 13, 2025 19:03

Mytherin marked this pull request as ready for review February 13, 2025 19:49

whenever seed is set, parallel sink is false

3c90da4

duckdb-draftbot marked this pull request as draft February 17, 2025 09:45

Tmonster marked this pull request as ready for review February 17, 2025 09:45

Mytherin merged commit 52811a9 into duckdb:v1.2-histrionicus Feb 17, 2025
50 checks passed

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Mar 7, 2025

vendor: Update vendored sources to duckdb/duckdb@52811a9

4215f20

use random seeds for bernoulli sample when parallel is enabled (duckdb/duckdb#16223)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use random seeds for bernoulli sample when parallel is enabled #16223

use random seeds for bernoulli sample when parallel is enabled #16223

Uh oh!

Tmonster commented Feb 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Mytherin commented Feb 17, 2025

Uh oh!

Uh oh!

use random seeds for bernoulli sample when parallel is enabled #16223

use random seeds for bernoulli sample when parallel is enabled #16223

Uh oh!

Conversation

Tmonster commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Mytherin commented Feb 17, 2025

Uh oh!

Uh oh!

Tmonster commented Feb 13, 2025 •

edited

Loading