Skip to content

read_csv with autodetect hangs on bad file above certain size #16245

@robhod

Description

@robhod

What happens?

read_csv with autodetect and sa,ple size -1 hangs on malformed csv when file is above some size.

The file is malformed but I would expect it to succeed or fail with errors.

The attached files have a field containing unescaped quote and separator.
e.g. "",""Unescaped quote and embedded, seperator"",

bad_csv_file_10000.csv

bad_csv_file_2045.csv
bad_csv_file_2046.csv
bad_csv_file_2047.csv

To Reproduce

Use files attached (added larger file than hangs on my machine in case it cannot be recreated with the 2047 example)
in duckdb cli the first query succeeds, the middle query fails to detect, last query hangs


from read_csv('bad_csv_file_2045.csv',auto_detect=TRUE,  sample_size=-1);

from read_csv('bad_csv_file_2046.csv',auto_detect=TRUE,  sample_size=-1);

from read_csv('bad_csv_file_2047.csv',auto_detect=TRUE,  sample_size=-1);

OS:

MacOS

DuckDB Version:

1.2

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Rob Hodgson

Affiliation:

SSC

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions