Skip to content

Incorrect nested data converted in Arrow <-> DuckDB conversion when read through Arrow DataSet (Python) #5547

@edavisau

Description

@edavisau

What happens?

I have a parquet file with a nested struct column. When querying this file and extracting the struct column, it appears to return the wrong data. It has done this when using a Arrow Dataset / Table as inputs.

I have uploaded the specific file in question down below. The nested aspect might be a coincidence - I have no idea what's going wrong.

To Reproduce

Dataset (zipped parquet file - could not upload parquet alone):
obfuscated.zip

Environment

Package         Version
--------------- ------------
duckdb          0.6.1.dev191
numpy           1.24.0rc1
pandas          1.5.2
pip             22.3.1
pyarrow         10.0.1
python-dateutil 2.8.2
pytz            2022.6
setuptools      65.5.1
six             1.16.0
wheel           0.37.1

Python code

import pyarrow.dataset as ds
import duckdb

data = ds.dataset("~/obfuscated.parquet")

con = duckdb.connect()

display("Select then filter")
display(con.execute("""
SELECT
    col1,
    col2,
    col3,
    col4,
    col5,
    nested_col.*,
FROM data
ORDER BY ALL
""").df().loc[lambda df: (df.col1 == 749) & (df.col2 == 747) & (df.col3 == 5) & (df.col4 == 1)])

display("Filter then select")
display(con.execute("""
SELECT
    col1,
    col2,
    col3,
    col4,
    col5,
    nested_col.*,
FROM data
WHERE col1 = 749
    AND col2 = 747
    AND col3 = 5
    AND col4 = 1
ORDER BY ALL
""").df())

con.close()

Output

image

The expected output is the second one above. This issue only shows with Arrow Dataset/Table inputs. The issue doesn't show when using a pandas DataFrame as inputs.

OS:

Linux x64

DuckDB Version:

duckdb-0.6.1.dev191

DuckDB Client:

Python

Full Name:

Edward Davis

Affiliation:

Veitch Lister Consulting

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions