-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Closed
Labels
Description
What happens?
When using Distinct On within a query, we observed significant and unexpected memory usage. For example, we used distinct on 4x in a small (under 100k row) table and memory usage grew to over 8GB. We refactored Distinct On to Pola.rs unique, and the Pola.rs memory impact was negligible. One note - we are directly querying an Arrow table rather than a native DuckDB table using a DuckDB in memory instance.
To Reproduce
select distinct
on (id, provider) record_key
from
arrow_table
order by
id,
provider,
record_rank desc,
record_date
OS:
Amazon Linux 2023
DuckDB Version:
0.8.1
DuckDB Client:
Python (3.11.2)
Full Name:
Jeff White
Affiliation:
Obie Insurance
Have you tried this on the latest master
branch?
I have not tested with any build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have