-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
This is a minimal reproduced issue but it is happening naturally in our production data. If a column starts with a #, then there is a row that has a quoted value, and then there is another column on row 2048 has another # duckdb fails to parse the csv. In the production data it is not on row 1 and 2048 but when narrowing it down to the rows causing the issue it seemed that there had to be at least 2048 rows to cause the issue.
NOTE: It seems to be related to the comment param. if I set the comment='' in my read csv, the query works. In the docs it says that the comment param defaults to (empty) but if I set it to anything other than # it works fine but if I set it to # then it reproduces the issue as well. So it seems that for some reason the query is trying to use # as the comment character even though it is not set.
Error Message
select * from read_csv('~/downloads/test.csv', strict_mode=false);
Invalid Input Error:
Error when sniffing file "/Users/thomas.boles/downloads/test.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ',', '|', ';', ' '
Quote/Escape Candidates: ['(no quote)','(no escape)'],['"','(no escape)'],['"','"'],['"','''],['"','\'],[''','(no escape)'],[''','''],[''','"'],[''','\']
Comment Candidates: '\0', '#'
Encoding: utf-8
Possible fixes:
* Make sure you are using the correct file encoding. If not, set it (e.g., encoding = 'utf-16').
* Set delimiter (e.g., delim=',')
* Set quote (e.g., quote='"')
* Set escape (e.g., escape='"')
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')
* Be sure that the maximum line size is set to an appropriate value, otherwise set it (e.g., max_line_size=10000000)
To Reproduce
Add the following to iss17744.sql
and run it with duckdb -bail -f iss17744.sql
:
COPY (
SELECT
-- The string must start with a hash to reproduce the issue
'#hash start' AS first_column,
1 AS second_column
UNION ALL
SELECT
-- Quoted value can go anywhere between rows 1 and 2048 just not 1 or 2048. It must be between the hashes
'"my, quoted value"' AS first_column,
1 AS second_column
UNION ALL
-- These rows make the csv 2048 rows long which is required to reproduce
SELECT
'any value' AS first_column,
1 AS second_column
FROM range(0, 2045)
UNION ALL
SELECT
-- This hash value must be somewhere in the string just not at the beginning
'hash not at start #' AS column_value,
1 AS second_column
) TO 'test.csv' (format csv, header 1);
-- This errors with or without strict mode, with or without quote='"'
select * from read_csv('test.csv', strict_mode=false);
-- Setting comment to an empty string '' works. It seems that the config is automatically setting comment to '#' which is causing the issue.
select * from read_csv('test.csv', strict_mode=false, comment='');
OS:
Mac OS 15.5 arm64
DuckDB Version:
1.3.0
DuckDB Client:
Java and Shell
Hardware:
No response
Full Name:
Thomas Boles
Affiliation:
Finqore
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have not tested with any build
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have