-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
I have a CSV file where sometimes non-quotes (i.e. commas) are escaped using a backslash. If then data occurs that has an escaped quote (i.e. "Hello \"World\"")
the CSV sniffer gets confused. If I provide the correct format, it will error out.
To Reproduce
Do a read_csv
(or sniff_csv
) on a CSV file with this content:
"foo","bar"
"332", "Surname\, Firstname"
"123","foo (\"foo\")"
The results are:
delim = |
quote = '
escape = \
If I now switch the rows, like this:
"foo","bar"
"123","foo (\"foo\")"
"332", "Surname\, Firstname"
The results are:
delim = ,
quote = '
escape = \
If I remove the backslash from that last line:
"foo","bar"
"123","foo (\"foo\")"
"332", "Surname, Firstname"
The results become correct, i.e.:
delim = ,
quote = "
escape = \
The CSV file I have seems to be escaping commas and that seems to set off the sniffer.
The problem I have is if I provide the correct settings to read_cvs
it errors out with:
D select * from read_csv('fuzz.csv', delim=',', quote='"', escape='\');
Invalid Input Error: Error when sniffing file "fuzz.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ','
Quote/Escape Candidates: ['"','"'],['"','\0'],['"',''']
Comment Candidates: '#', '\0'
Possible fixes:
* Delimiter is set to ','. Consider unsetting it.
* Quote is set to '"'. Consider unsetting it.
* Escape is set to '\'. Consider unsetting it.
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')
I suspect the problem has to do with escaped constructs such as foo\, bar
.
Is this a bug or is it supposed to be like this?
OS:
Linux
DuckDB Version:
v1.1.1 af39bd0
DuckDB Client:
CLI
Hardware:
No response
Full Name:
Helgi Kristjansson
Affiliation:
Roaring Group AB
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have not tested with any build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have