Skip to content

Performance regression in 1.3.0 reading Geoparquet #17855

@DFEvans

Description

@DFEvans

What happens?

Between 1.2.2 and 1.3.0, I'm seeing much worse performance (runtime, CPU use, and memory use) running some queries on an S3-hosted Geoparquet dataset (the Overture buildings dataset specifically).

Running the example script below, and using memray for memory reporting, I see the following:

  • 1.2.2: 48s, peak memory use 210MiB
  • 1.3.0: 328s, peak memory use 5.5GiB

I have observed this behaviour both locally, running under Ubuntu on WSL2, and on a cloud-hosted Docker image also running Ubuntu. The issue appears to vary in severity depending on the xmin/xmax/ymin/ymax values - the example below is the worst I've seen, causing OOM/timeouts on the cloud hosted version, while others have similar runtimes and memory footprints to before.

The actual results of the query have not changed.

To Reproduce

import duckdb
import time

DATASET_URL = r"s3://overturemaps-us-west-2/release/2025-03-19.0/theme=buildings/type=building/*"

OVERTURE_QUERY_TEMPLATE = """SELECT
    id,
    ST_AsText(geometry) as geometry
FROM
    read_parquet('{url}', filename=true, hive_partitioning=1)
WHERE
    bbox.xmax >= {xmin}
        and bbox.xmin <= {xmax}
        and bbox.ymax >= {ymin}
        and bbox.ymin <= {ymax}
"""

conn = duckdb.connect(":memory:")
conn.install_extension("spatial")
conn.load_extension("spatial")

query = OVERTURE_QUERY_TEMPLATE.format(
    url=DATASET_URL,
    xmin=113.8559225832035,
    ymin=-1.8211608973077427,
    xmax=113.89983634855709,
    ymax=-1.7858043460487265,
)

t1 = time.time()
conn.sql(query).to_csv("dump.csv")
t2 = time.time()
print(t2-t1)

OS:

Ubuntu 22.04.5 LTS, x86_64, running under WSL2 locally

DuckDB Version:

1.3.0

DuckDB Client:

Python

Hardware:

AMD Ryzen 9 5900HS with 32 GB RAM locally; cloud instance is an AWS Lambda set to 4 GiB RAM.

Full Name:

Daniel Evans

Affiliation:

SatVu

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions