-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Labels
Description
What happens?
I have a parquet file with a nested struct column. When querying this file and extracting the struct column, it appears to return the wrong data. It has done this when using a Arrow Dataset / Table as inputs.
I have uploaded the specific file in question down below. The nested aspect might be a coincidence - I have no idea what's going wrong.
To Reproduce
Dataset (zipped parquet file - could not upload parquet alone):
obfuscated.zip
Environment
Package Version
--------------- ------------
duckdb 0.6.1.dev191
numpy 1.24.0rc1
pandas 1.5.2
pip 22.3.1
pyarrow 10.0.1
python-dateutil 2.8.2
pytz 2022.6
setuptools 65.5.1
six 1.16.0
wheel 0.37.1
Python code
import pyarrow.dataset as ds
import duckdb
data = ds.dataset("~/obfuscated.parquet")
con = duckdb.connect()
display("Select then filter")
display(con.execute("""
SELECT
col1,
col2,
col3,
col4,
col5,
nested_col.*,
FROM data
ORDER BY ALL
""").df().loc[lambda df: (df.col1 == 749) & (df.col2 == 747) & (df.col3 == 5) & (df.col4 == 1)])
display("Filter then select")
display(con.execute("""
SELECT
col1,
col2,
col3,
col4,
col5,
nested_col.*,
FROM data
WHERE col1 = 749
AND col2 = 747
AND col3 = 5
AND col4 = 1
ORDER BY ALL
""").df())
con.close()
Output
The expected output is the second one above. This issue only shows with Arrow Dataset/Table inputs. The issue doesn't show when using a pandas DataFrame as inputs.
OS:
Linux x64
DuckDB Version:
duckdb-0.6.1.dev191
DuckDB Client:
Python
Full Name:
Edward Davis
Affiliation:
Veitch Lister Consulting
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree