Skip to content

Bernoulli sample gives strange results in CTE with DISTINCT #16175

@ADBond

Description

@ADBond

What happens?

I am getting strange results in duckdb 1.2.0 when I Bernoulli sample from a table in a CTE from a table with SELECT DISTINCT. I am finding I get a table with no rows far more often than I would expect (and occasionally a table with far more rows than I would expect also).

For instance, sampling from a table with 1000 rows with a Bernoulli probability of 1% I expect an empty table about 1 in ~23000 times. But it is happening nearly 1 in 3 times.

There seems to be some interaction with using DISTINCT - if I sample from the table with a SELECT DISTINCT clause I get the strange results (even if my table already has distinct values), but if I sample directly from my source table (which is already distinct) I get the results I expect.

This does not occur in duckdb 1.1.3, where I get the results I expect.

To Reproduce

With this python script:

from statistics import mean, stdev

import duckdb

con = duckdb.connect()
sql = """
WITH some_tab AS (
    SELECT UNNEST(range(1000)) AS id
),
some_tab_unq AS (
    SELECT distinct(id) AS id FROM some_tab
),
sampled AS (
    select id from some_tab_unq
    USING SAMPLE 1% (bernoulli)
)
select count(*) as n_rows FROM sampled
"""

row_counts = [con.sql(sql).fetchone()[0] for _ in range(500)]
print({
    "min_rows": min(row_counts),
    "max_rows": max(row_counts),
    "count_empty": sum([r == 0 for r in row_counts]),
    "mean": mean(row_counts),
    "std": stdev(row_counts)
})

A typical output is

{'min_rows': 0, 'max_rows': 50, 'count_empty': 146, 'mean': 10.06, 'std': 11.635599440883885}

This gives far more empty tables than expected, and the standard deviation is far off (should be ~3.1).

In duckdb 1.1.3 this gives (something like)

{'min_rows': 2, 'max_rows': 22, 'count_empty': 0, 'mean': 10.366, 'std': 3.1502209102802827}

If the query is instead

WITH some_tab AS (
    SELECT UNNEST(range(1000)) AS id
),
sampled AS (
    select id from some_tab
    USING SAMPLE 1% (bernoulli)
)
select count(*) as n_rows FROM sampled

(which I would expect to give identical results) I get the expected set of values.

I have tried these queries directly in the CLI and find the same issue.

OS:

MacOS 15.3, x86_64

DuckDB Version:

1.2.0

DuckDB Client:

Python

Hardware:

No response

Full Name:

Andrew Bond

Affiliation:

Ministry of Justice

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions