Skip to content

Conversation

Tmonster
Copy link
Contributor

Addresses https://github.com/duckdblabs/duckdb-internal/issues/4110

Sampling has bad memory management right now, which is causing larger ingestions to fail. In the interest of releasing early next week, it's probably best that it is removed and later added when memory issues are resolved

@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 31, 2025 08:16
@Tmonster Tmonster marked this pull request as ready for review January 31, 2025 08:16
@gregorywaynepower
Copy link

Does this mean that y'all are removing sampling altogether?

@Mytherin
Copy link
Collaborator

Mytherin commented Feb 1, 2025

Does this mean that y'all are removing sampling altogether?

No, this is unrelated to the sample clause - this is about gathering samples as part of the table statistics to guide optimization. While the gathering part was implemented in this version the optimizer is not using it yet so this should not have any user-facing effect.

@Mytherin Mytherin merged commit 2451285 into duckdb:v1.2-histrionicus Feb 1, 2025
50 checks passed
Mytherin added a commit that referenced this pull request Feb 4, 2025
Following up on #16010 I was
wondering if too much memory was used during regular sampling as well.
Turns out there is and it is because we over sample.

If we have a reservoir sample of 5 rows (either by sampling 0.00005% or
5 rows) then the capacity of the reservoir is calculated to be the
following

```
= sample-count + (FIXED_SAMPLE_SIZE_MULTIPLIER * FIXED_SAMPLE_SIZE)
= 5 + 10 * 2048
```

The 10*2048 is quite large compared to the 5 rows. For percentage
samples the extra allocations can add up since one sample is created per
100000 rows. The updated formula is
```
= sample-count + (FIXED_SAMPLE_SIZE_MULTIPLIER * MinValue<idx_t>(FIXED_SAMPLE_SIZE, sample_count))
= 5 + 10 * 5
```

which reduces the extra space by 400x, which will also reduce the memory
pressure during sampling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants