Skip to content

Bug when reading csv with # and quoted values using read_csv #17744

@tcboles

Description

@tcboles

What happens?

This is a minimal reproduced issue but it is happening naturally in our production data. If a column starts with a #, then there is a row that has a quoted value, and then there is another column on row 2048 has another # duckdb fails to parse the csv. In the production data it is not on row 1 and 2048 but when narrowing it down to the rows causing the issue it seemed that there had to be at least 2048 rows to cause the issue.

NOTE: It seems to be related to the comment param. if I set the comment='' in my read csv, the query works. In the docs it says that the comment param defaults to (empty) but if I set it to anything other than # it works fine but if I set it to # then it reproduces the issue as well. So it seems that for some reason the query is trying to use # as the comment character even though it is not set.

Error Message

select * from read_csv('~/downloads/test.csv', strict_mode=false);
Invalid Input Error:
Error when sniffing file "/Users/thomas.boles/downloads/test.csv".
It was not possible to automatically detect the CSV Parsing dialect/types
The search space used was:
Delimiter Candidates: ',', '|', ';', '	'
Quote/Escape Candidates: ['(no quote)','(no escape)'],['"','(no escape)'],['"','"'],['"','''],['"','\'],[''','(no escape)'],[''','''],[''','"'],[''','\']
Comment Candidates: '\0', '#'
Encoding: utf-8
Possible fixes:
* Make sure you are using the correct file encoding. If not, set it (e.g., encoding = 'utf-16').
* Set delimiter (e.g., delim=',')
* Set quote (e.g., quote='"')
* Set escape (e.g., escape='"')
* Set comment (e.g., comment='#')
* Set skip (skip=${n}) to skip ${n} lines at the top of the file
* Enable ignore errors (ignore_errors=true) to ignore potential errors
* Enable null padding (null_padding=true) to pad missing columns with NULL values
* Check you are using the correct file compression, otherwise set it (e.g., compression = 'zstd')
* Be sure that the maximum line size is set to an appropriate value, otherwise set it (e.g., max_line_size=10000000)

To Reproduce

Add the following to iss17744.sql and run it with duckdb -bail -f iss17744.sql:

COPY (
    SELECT
        -- The string must start with a hash to reproduce the issue
        '#hash start' AS first_column,
        1 AS second_column

    UNION ALL

    SELECT
        -- Quoted value can go anywhere between rows 1 and 2048 just not 1 or 2048. It must be between the hashes
        '"my, quoted value"' AS first_column,
        1 AS second_column

    UNION ALL

    -- These rows make the csv 2048 rows long which is required to reproduce
    SELECT
        'any value' AS first_column,
        1 AS second_column
    FROM range(0, 2045)

    UNION ALL

    SELECT
        -- This hash value must be somewhere in the string just not at the beginning
        'hash not at start #' AS column_value,
     1 AS second_column
) TO 'test.csv' (format csv, header 1);

-- This errors with or without strict mode, with or without quote='"'
select * from read_csv('test.csv', strict_mode=false);

-- Setting comment to an empty string '' works. It seems that the config is automatically setting comment to '#' which is causing the issue.
select * from read_csv('test.csv', strict_mode=false, comment='');

OS:

Mac OS 15.5 arm64

DuckDB Version:

1.3.0

DuckDB Client:

Java and Shell

Hardware:

No response

Full Name:

Thomas Boles

Affiliation:

Finqore

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions