-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
I have large dataset that contains a deeply nested json field I am attempting to unnest. It is the ingredients column that is part of this dataset:
https://huggingface.co/datasets/openfoodfacts/product-database
I have loaded this parquet file into my database and I am attempting to unnest it with the following query:
with
parsed_ingredients as (
select code, ingredients::json[] as json_array
from "mydb"."main_raw"."raw__open_food_facts"
),
ingredient_hierarchy as (
select
code,
unnest(json_array, max_depth := 2) as ingredient_name,
generate_subscripts(json_array, 1) as ingredient_order,
from parsed_ingredients
)
select *
from ingredient_hierarchy
order by code, ingredient_order
When running it results in the following error (note I am limiting my duckdb instance to only use 8gb to run on my hardware):
Out of Memory Error: could not allocate block of size 256.0 KiB (7.4 GiB/7.4 GiB used)
I have also tried converting the field to a struct first but this causes it run out of memory before I even begin unnesting. How can i properly process this field without running OOM?
To Reproduce
Download the parquet data and run the code above to reporoduce.
OS:
Windows
DuckDB Version:
1.7.0
DuckDB Client:
DuckDB-DBT
Hardware:
No response
Full Name:
Kyle Burke
Affiliation:
None
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have