Skip to content

Distinct On Memory Issues #8505

@jeffdwhite

Description

@jeffdwhite

What happens?

When using Distinct On within a query, we observed significant and unexpected memory usage. For example, we used distinct on 4x in a small (under 100k row) table and memory usage grew to over 8GB. We refactored Distinct On to Pola.rs unique, and the Pola.rs memory impact was negligible. One note - we are directly querying an Arrow table rather than a native DuckDB table using a DuckDB in memory instance.

To Reproduce

select distinct
  on (id, provider) record_key
from
  arrow_table
order by
  id,
  provider,
  record_rank desc,
  record_date

OS:

Amazon Linux 2023

DuckDB Version:

0.8.1

DuckDB Client:

Python (3.11.2)

Full Name:

Jeff White

Affiliation:

Obie Insurance

Have you tried this on the latest master branch?

I have not tested with any build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions