Skip to content

Performance regression: DISTINCT ON got slower and more memory hungry #10675

@soerenwolfers

Description

@soerenwolfers

What happens?

DISTINCT ON has undergone a significant performance regression at some point between 0.9.2 and 0.9.3.dev3715 (unfortunately, I sit behind a corporate proxy and cannot use anything in-between). It's taking up to 3times longer when successful, and it's taking up to 10x VIRT memory and 4x RES memory, making it run OOM more often.
(Also, surprisingly the memory consumption is additive across queries, which to the uninformed observer might indicate a memory leak).

I have previously hinted at this in #10224 but since the main issue there has since been resolved, I figured it's time to create a separate issue for this.

To Reproduce

First, produce a MWE dataframe:

import duckdb
import pandas as pd
import numpy as np
times = pd.date_range('2019-01-01 00:00:00', '2020-12-31 00:00:00', freq='30 min')
n_times = len(times)
n_products = 3000
n = n_times * n_products
df = pd.DataFrame({
    'time': np.repeat(times, n_products),
    'value': np.arange(n),
    'product': np.tile(np.arange(n_products), n_times)
})
df['time'] = df['time'].dt.tz_localize('UTC').astype('datetime64[us, UTC]')
duckdb.query("""
SET MEMORY_LIMIT='64GB';
SELECT * FROM duckdb_settings()  
WHERE name SIMILAR TO '.*mem.*'
""")

Then, using 0.9.3.dev3715 or 0.10.0, the following query

duckdb.query("""
SELECT 
DISTINCT ON (time::DATE, product)
*
FROM df
ORDER by time DESC
""")

takes one minute on the first run, peaking at 190GB VIRT and 50GB RES memory usage as reported by htop. On rerunning the query, it almost immediately fails with

OutOfMemoryException: Out of Memory Error: Failed to allocate block of 262144 bytes

On 0.9.2, the same query completes in 25 seconds, peaking at 19GB VIRT and 16GB RES memory usage as reported by htop. Also, the query can be rerun arbitrarily often without increasing VIRT or RES and finishing in ~18s on reruns.

If I reduce n_products to 2000, 0.10.0 takes 41s, peaking at 100GB VIRT / 30GB RES, whereas 0.9.2 finishes in 10-14s, peaking at 10GB VIRT / 8GB RES.

NOTES:

  • The memory limit of 64GB doesn't seem to do anything, I just put it in as a safety because while playing around with the parameters of this MWE I quite frequently made my machine freeze to a point where it wouldn't even accept ssh connections anymore (the freezing happens on the duckdb query). This is probably for another ticket though, and I do get the same problems with a 128GB limit.
  • The UTC localization is only needed because the query will fail on the date cast in 0.9.2 otherwise. It doesn't affect memory usage and runtime on later versions.

OS:

Linux

DuckDB Version:

0.10.0

DuckDB Client:

python '3.10.6
pandas '2.1.4'
numpy '1.26.3'

Full Name:

Soeren Wolfers

Affiliation:

G-Research

Have you tried this on the latest nightly build?

I have tested with a nightly build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions