Skip to content

CSV reader will delete valid records when erroneous newlines are present #14177

@phillipwright7

Description

@phillipwright7

What happens?

When using the CSV reader in version 1.1.1, valid rows from the CSV will be deleted when there are erroneous newlines present in the CSV surrounding the valid records. This occurs when you use the ignore_errors parameter. If you set ignore_errors to False or use store_rejects, then no data loss occurs.

I believe this is correlated to CSVs with a large amount of records. During my testing, as I've reduced the amount of records in my test CSV the amount of dataloss decreased as the amount of records decreased. I'll provide my sample data below.

This issue first began happening in version 1.1.0, and this is not an issue in version 1.0.0.

To Reproduce

First, download the sample data here: test_duckdb_data.csv.zip

Using this CSV, we'll create two views in DuckDB:

CREATE VIEW test_1 AS
SELECT *
FROM read_csv_auto('test_duckdb_data.csv',
    header=True,
    ignore_errors=True,
    delim=',',
    all_varchar=True
    );

test_1 is ignoring errors.

CREATE VIEW test_2 AS
SELECT *
FROM read_csv_auto('test_duckdb_data.csv',
    header=True,
    ignore_errors=True,
    delim=',',
    all_varchar=True,
    store_rejects=True
    );

test_2 is storing rejects (or you can set ignore_errors=False).

Once these two views have been loaded from the same data, you can test the amount of records for both. You should see a discrepancy:

WITH view_counts AS (
    SELECT
        (SELECT COUNT(*) FROM test_1) AS test_1_count,
        (SELECT COUNT(*) FROM test_2) AS test_2_count
    )
SELECT
    test_1_count,
    test_2_count,
    ABS((test_2_count - test_1_count)) AS difference
FROM
    view_counts;

And you can looking at the records that exist in test_2 but not in test_1:

SELECT *
FROM test_2
EXCEPT
SELECT *
FROM test_1;

OS:

macOS 14.6.1, Apple M1 (AArch64)

DuckDB Version:

1.1.1

DuckDB Client:

Python

Hardware:

No response

Full Name:

Phillip Wright

Affiliation:

Shipt

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions