Skip to content

CSV sniff gets confused by escaped non-quote characters and seems to ignore parameters #14304

@djupudga

Description

@djupudga

What happens?

I have a CSV file where sometimes non-quotes (i.e. commas) are escaped using a backslash. If then data occurs that has an escaped quote (i.e. "Hello \"World\"") the CSV sniffer gets confused. If I provide the correct format, it will error out.

To Reproduce

Do a read_csv (or sniff_csv) on a CSV file with this content:

"foo","bar"
"332", "Surname\, Firstname"
"123","foo (\"foo\")"

The results are:

delim = |
quote = '
escape = \

If I now switch the rows, like this:

"foo","bar"
"123","foo (\"foo\")"
"332", "Surname\, Firstname"

The results are:

delim = ,
quote = '
escape = \

If I remove the backslash from that last line:

"foo","bar"
"123","foo (\"foo\")"
"332", "Surname, Firstname"

The results become correct, i.e.:

delim = ,
quote = "
escape = \

The CSV file I have seems to be escaping commas and that seems to set off the sniffer.

The problem I have is if I provide the correct settings to read_cvs it errors out with:

D select * from read_csv('fuzz.csv', delim=',', quote='"', escape='\');
Invalid Input Error: Error when sniffing file "fuzz.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ','
Quote/Escape Candidates: ['"','"'],['"','\0'],['"',''']
Comment Candidates: '#', '\0'
Possible fixes:
* Delimiter is set to ','. Consider unsetting it.
* Quote is set to '"'. Consider unsetting it.
* Escape is set to '\'. Consider unsetting it.
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')

I suspect the problem has to do with escaped constructs such as foo\, bar.

Is this a bug or is it supposed to be like this?

OS:

Linux

DuckDB Version:

v1.1.1 af39bd0

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Helgi Kristjansson

Affiliation:

Roaring Group AB

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions