-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
What happens?
In a mac-ARM machine When building a query that reads multiple parquet files specified as an absolute path in the OS file system but having *.parquet
at the end (e.g. /path/to/the/files/*.parquet
), executing a query under a DuckDB process that has limited its own working memory through SET memory_limit
will end up segfaulting when it needs to spill data to disk.
The bug doesn't occur anymore if the path is specified as a relative path from the working directory (i.e. just *.parquet
).
The same query is able to execute successfully in linux-amd64 without crashing. Removing the memory limit avoids the crash in mac-ARM. I'm not sure if it's related to being ARM or being macOS or something else, but it does end up segfaulting, whether it's executed in Python or in R.
In case it's related, when compiling DuckDB from source on the machine where it crashes, I see many compilation warnings about jemalloc
.
And it case it helps, when executed in R, the crash is always blamed on address 0x4 (with "invalid permissions") which looks like something that wouldn't be the output of a memory allocation.
To Reproduce
Code to generate files for a query:
import numpy as np, pandas as pd
import string
rng = np.random.default_rng(seed=123)
def gen_large_parquet(file_number):
n = int(5e6)
pct_missing = 0.05
def gen_ts_col():
low = np.datetime64("2020-01-01 00:00:00")
high = np.datetime64("2020-12-31 23:59:59")
out = rng.integers(
low.astype(int),
high.astype(int),
size=n
).astype("datetime64[s]")
set_missing = rng.uniform(size=n) <= pct_missing
out[set_missing] = np.datetime64("NaT")
return out
def gen_num_col():
out = rng.standard_normal(size=n)
set_missing = rng.uniform(size=n) <= pct_missing
out[set_missing] = np.nan
return out
def gen_random_word():
str_lenght = rng.integers(low=1, high=20)
choices = [c for c in string.ascii_uppercase + string.digits]
return "".join(rng.choice(choices, size=str_lenght))
def gen_cat_col():
ncateg = rng.integers(low=3, high=20)
choices = [gen_random_word() for cat in range(ncateg)]
out = rng.choice(choices, size=n).astype(object)
set_missing = rng.uniform(size=n) <= pct_missing
out[set_missing] = None
return out
df = pd.DataFrame({
"dt" : gen_ts_col(),
"cat1" : gen_cat_col(),
"cat2" : gen_cat_col(),
"cat3" : gen_cat_col(),
"num1" : gen_num_col(),
"num2" : gen_num_col(),
"num3" : gen_num_col(),
})
n_additional = 10
for i in range(n_additional):
name_num = "additional_num_" + str(i)
name_cat = "additional_cat_" + str(i)
name_dt = "additional_dt_" + str(i)
df[name_num] = gen_num_col()
df[name_cat] = gen_cat_col()
df[name_dt] = gen_ts_col()
df.to_parquet(f"df{file_number}.parquet")
for file_number in range(3):
gen_large_parquet(file_number)
The query:
import duckdb
q = """
with tbl0 as (
select
*
from read_parquet(
'*.parquet'
, filename=true
, file_row_number=true
, union_by_name=true
)
where
dt is not null
and dt >= '2020-04-01'
and dt <= '2020-08-01'
and cat1 is not null
and num1 is not null
)
, tbl_join1 as (
select
cat1
, favg("num2") as "num2_avg"
, count(*) as num2_cnt
from tbl0
where
"num2" is not null
and not isnan("num2"::float)
group by
cat1
)
, tbl_join2 as (
select
cat3
, favg("num3") as "num3_avg"
, count(*) as num3_cnt
from tbl0
where
"num3" is not null
and not isnan("num3"::float)
group by cat3
)
select
tbl0.cat1
, tbl0.cat3
, cat2
, num1
, num3
, num2
, dt
, num2_avg
, num2_cnt
, num3_avg
, num3_cnt
, row_number() over(order by filename, file_row_number) as row_num
from tbl0
join tbl_join1
on tbl_join1.cat1 = tbl0.cat1
join tbl_join2
on tbl_join2.cat3 = tbl0.cat3
where
"dt" is not null
order by filename, file_row_number
""".strip()
duckdb.query("SET memory_limit='1GB';")
res = duckdb.query(q).df()
OS:
macOS-aarch64
DuckDB Version:
0.7.2-dev1146
DuckDB Client:
Python and R
Full Name:
David Cortes
Affiliation:
None
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree