Skip to content

csv/csv.gz Files With Empty Rows Auto-Inferred As VARCHAR #14166

@Gedaria

Description

@Gedaria

What happens?

When reading multiple csv or csv.gz files say with a glob pattern, and one of those has no rows, all the columns in that file is casted as a VARCHAR. This causes a schema mismatch between that file and all other files when they have a defined type.

To Reproduce

Run this in the DuckDB CLI (or run in an in-memory database in Python)

CREATE OR REPLACE TABLE
    cte AS
SELECT
    generate_series::DATE as date_col,
    (random()*100)::INTEGER AS int_col,
    (random()*123.45)::DOUBLE AS double_col
FROM
    GENERATE_SERIES(TIMESTAMP '2000-01-01 00:00:00', TIMESTAMP '2000-12-31 00:00:00', INTERVAL 1 DAY);

COPY (
    FROM
        cte
    WHERE
        EXTRACT(YEAR FROM date_col) == 2000
) TO '__2000.csv.gz';

COPY (
    FROM
        cte
    WHERE
        EXTRACT(YEAR FROM date_col) == 2001
) TO '__2001.csv.gz';

FROM
    '__200*.csv.gz';

Error:

Invalid Input Error: Schema mismatch between globbed files.
Main file schema: __2000.csv.gz
Current file: __2001.csv.gz
Column with name: "date_col" is expected to have type: DATE But has type: VARCHAR
Column with name: "int_col" is expected to have type: BIGINT But has type: VARCHAR
Column with name: "double_col" is expected to have type: DOUBLE But has type: VARCHAR
Potential Fix: Since your schema has a mismatch, consider setting union_by_name=true.

Note that this also occurs when the format for the output file to be read again is csv, but does not occur when the output format is a parquet file.

OS:

macOS aarch64 / RHEL 7.6 Linux x86_64

DuckDB Version:

1.1.1

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Kian Francis Gedaria Laredo

Affiliation:

Phitopolis International Corporation

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions