Skip to content

selecting struct column using parquet_scan returns varchar instead of struct #3036

@pdutta777

Description

@pdutta777

What happens?

If reading a nested column from a parquet file using the parquet_scan function, the data is returned as a varchar. But if I read the parquet file into an arrow table, and then query against the arrow table, the proper struct is returned.

To Reproduce

import duckdb
import pyarrow as pa
import pyarrow.parquet as pq
from os.path import exists

def create_parquet():
    sch = pa.schema([
        pa.field("col1", pa.string()),
        pa.field("col2", pa.struct([
            pa.field("nested", pa.string()),
        ]))
    ])

    data={
        "col1": ["A", "B", "C"],
        "col2": [{"nested": "A"}, {"nested": "B"}, {"nested": "C"}]
    }

    tbl = pa.table(data, schema=sch)
    pq.write_table(table=tbl, where="testdata.parquet")


if __name__ == '__main__':
    if not exists("testdata.parquet"):
        create_parquet()
    con = duckdb.connect()
    print(con.execute("select col2 from parquet_scan('testdata.parquet')").fetchall())


    arrow = duckdb.arrow(pq.read_table("testdata.parquet"))
    print(arrow.query('arrow', "select col2 from arrow").fetchall())

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • DuckDB Version: 0.3.1
  • DuckDB Client: python-3.8, arrow-6.0.1

Before Submitting

  • Have you tried this on the latest master branch? : No
  • Python: pip install duckdb --upgrade --pre
  • R: install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
  • Other Platforms: You can find binaries here or compile from source.
  • [x ] Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions