-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Closed
Description
What happens?
If reading a nested column from a parquet file using the parquet_scan
function, the data is returned as a varchar. But if I read the parquet file into an arrow table, and then query against the arrow table, the proper struct is returned.
To Reproduce
import duckdb
import pyarrow as pa
import pyarrow.parquet as pq
from os.path import exists
def create_parquet():
sch = pa.schema([
pa.field("col1", pa.string()),
pa.field("col2", pa.struct([
pa.field("nested", pa.string()),
]))
])
data={
"col1": ["A", "B", "C"],
"col2": [{"nested": "A"}, {"nested": "B"}, {"nested": "C"}]
}
tbl = pa.table(data, schema=sch)
pq.write_table(table=tbl, where="testdata.parquet")
if __name__ == '__main__':
if not exists("testdata.parquet"):
create_parquet()
con = duckdb.connect()
print(con.execute("select col2 from parquet_scan('testdata.parquet')").fetchall())
arrow = duckdb.arrow(pq.read_table("testdata.parquet"))
print(arrow.query('arrow', "select col2 from arrow").fetchall())
Environment (please complete the following information):
- OS: Ubuntu 20.04
- DuckDB Version: 0.3.1
- DuckDB Client: python-3.8, arrow-6.0.1
Before Submitting
- Have you tried this on the latest
master
branch? : No
- Python:
pip install duckdb --upgrade --pre
- R:
install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
- Other Platforms: You can find binaries here or compile from source.
- [x ] Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
Metadata
Metadata
Assignees
Labels
No labels