-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
Between 1.2.2 and 1.3.0, I'm seeing much worse performance (runtime, CPU use, and memory use) running some queries on an S3-hosted Geoparquet dataset (the Overture buildings dataset specifically).
Running the example script below, and using memray
for memory reporting, I see the following:
- 1.2.2: 48s, peak memory use 210MiB
- 1.3.0: 328s, peak memory use 5.5GiB
I have observed this behaviour both locally, running under Ubuntu on WSL2, and on a cloud-hosted Docker image also running Ubuntu. The issue appears to vary in severity depending on the xmin/xmax/ymin/ymax values - the example below is the worst I've seen, causing OOM/timeouts on the cloud hosted version, while others have similar runtimes and memory footprints to before.
The actual results of the query have not changed.
To Reproduce
import duckdb
import time
DATASET_URL = r"s3://overturemaps-us-west-2/release/2025-03-19.0/theme=buildings/type=building/*"
OVERTURE_QUERY_TEMPLATE = """SELECT
id,
ST_AsText(geometry) as geometry
FROM
read_parquet('{url}', filename=true, hive_partitioning=1)
WHERE
bbox.xmax >= {xmin}
and bbox.xmin <= {xmax}
and bbox.ymax >= {ymin}
and bbox.ymin <= {ymax}
"""
conn = duckdb.connect(":memory:")
conn.install_extension("spatial")
conn.load_extension("spatial")
query = OVERTURE_QUERY_TEMPLATE.format(
url=DATASET_URL,
xmin=113.8559225832035,
ymin=-1.8211608973077427,
xmax=113.89983634855709,
ymax=-1.7858043460487265,
)
t1 = time.time()
conn.sql(query).to_csv("dump.csv")
t2 = time.time()
print(t2-t1)
OS:
Ubuntu 22.04.5 LTS, x86_64, running under WSL2 locally
DuckDB Version:
1.3.0
DuckDB Client:
Python
Hardware:
AMD Ryzen 9 5900HS with 32 GB RAM locally; cloud instance is an AWS Lambda set to 4 GiB RAM.
Full Name:
Daniel Evans
Affiliation:
SatVu
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have