Skip to content

Segfault when query specifiying an absolute path with globs for inputs is executed with limited memory #6881

@david-cortes

Description

@david-cortes

What happens?

In a mac-ARM machine When building a query that reads multiple parquet files specified as an absolute path in the OS file system but having *.parquet at the end (e.g. /path/to/the/files/*.parquet), executing a query under a DuckDB process that has limited its own working memory through SET memory_limit will end up segfaulting when it needs to spill data to disk.

The bug doesn't occur anymore if the path is specified as a relative path from the working directory (i.e. just *.parquet).

The same query is able to execute successfully in linux-amd64 without crashing. Removing the memory limit avoids the crash in mac-ARM. I'm not sure if it's related to being ARM or being macOS or something else, but it does end up segfaulting, whether it's executed in Python or in R.

In case it's related, when compiling DuckDB from source on the machine where it crashes, I see many compilation warnings about jemalloc.

And it case it helps, when executed in R, the crash is always blamed on address 0x4 (with "invalid permissions") which looks like something that wouldn't be the output of a memory allocation.

To Reproduce

Code to generate files for a query:

import numpy as np, pandas as pd
import string

rng = np.random.default_rng(seed=123)

def gen_large_parquet(file_number):
    n = int(5e6)
    pct_missing = 0.05
    
    def gen_ts_col():
        low = np.datetime64("2020-01-01 00:00:00")
        high = np.datetime64("2020-12-31 23:59:59")
        out = rng.integers(
            low.astype(int),
            high.astype(int),
            size=n
        ).astype("datetime64[s]")
        set_missing = rng.uniform(size=n) <= pct_missing
        out[set_missing] = np.datetime64("NaT")
        return out
    
    def gen_num_col():
        out = rng.standard_normal(size=n)
        set_missing = rng.uniform(size=n) <= pct_missing
        out[set_missing] = np.nan
        return out

    def gen_random_word():
        str_lenght = rng.integers(low=1, high=20)
        choices = [c for c in string.ascii_uppercase + string.digits]
        return "".join(rng.choice(choices, size=str_lenght))

    def gen_cat_col():
        ncateg = rng.integers(low=3, high=20)
        choices = [gen_random_word() for cat in range(ncateg)]
        out = rng.choice(choices, size=n).astype(object)
        set_missing = rng.uniform(size=n) <= pct_missing
        out[set_missing] = None
        return out

    df = pd.DataFrame({
        "dt" : gen_ts_col(),
        "cat1" : gen_cat_col(),
        "cat2" : gen_cat_col(),
        "cat3" : gen_cat_col(),
        "num1" : gen_num_col(),
        "num2" : gen_num_col(),
        "num3" : gen_num_col(),
    })

    n_additional = 10
    for i in range(n_additional):
        name_num = "additional_num_" + str(i)
        name_cat = "additional_cat_" + str(i)
        name_dt = "additional_dt_" + str(i)
        df[name_num] = gen_num_col()
        df[name_cat] = gen_cat_col()
        df[name_dt] = gen_ts_col()

    df.to_parquet(f"df{file_number}.parquet")

for file_number in range(3):
    gen_large_parquet(file_number)

The query:

import duckdb

q = """
with tbl0 as (
select
    *
from read_parquet(
    '*.parquet'
    , filename=true
    , file_row_number=true
    , union_by_name=true
)
where
    dt is not null
    and dt >= '2020-04-01'
    and dt <= '2020-08-01'
    and cat1 is not null
    and num1 is not null
)
, tbl_join1 as (
    select
        cat1
        , favg("num2") as "num2_avg"
        , count(*) as num2_cnt
    from tbl0
    where
        "num2" is not null
        and not isnan("num2"::float)
    group by
        cat1
)
, tbl_join2 as (
    select
        cat3
        , favg("num3") as "num3_avg"
        , count(*) as num3_cnt
    from tbl0
    where
        "num3" is not null
        and not isnan("num3"::float)
    group by cat3
)
select
    tbl0.cat1
    , tbl0.cat3
    , cat2
    , num1
    , num3
    , num2
    , dt
    , num2_avg
    , num2_cnt
    , num3_avg
    , num3_cnt
    , row_number() over(order by filename, file_row_number) as row_num
from tbl0
join tbl_join1
    on tbl_join1.cat1 = tbl0.cat1
join tbl_join2
    on tbl_join2.cat3 = tbl0.cat3
where
    "dt" is not null
order by filename, file_row_number
""".strip()

duckdb.query("SET memory_limit='1GB';")
res = duckdb.query(q).df()

OS:

macOS-aarch64

DuckDB Version:

0.7.2-dev1146

DuckDB Client:

Python and R

Full Name:

David Cortes

Affiliation:

None

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions