-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
When using the CSV reader in version 1.1.1
, valid rows from the CSV will be deleted when there are erroneous newlines present in the CSV surrounding the valid records. This occurs when you use the ignore_errors
parameter. If you set ignore_errors
to False
or use store_rejects
, then no data loss occurs.
I believe this is correlated to CSVs with a large amount of records. During my testing, as I've reduced the amount of records in my test CSV the amount of dataloss decreased as the amount of records decreased. I'll provide my sample data below.
This issue first began happening in version 1.1.0
, and this is not an issue in version 1.0.0
.
To Reproduce
First, download the sample data here: test_duckdb_data.csv.zip
Using this CSV, we'll create two views in DuckDB:
CREATE VIEW test_1 AS
SELECT *
FROM read_csv_auto('test_duckdb_data.csv',
header=True,
ignore_errors=True,
delim=',',
all_varchar=True
);
test_1
is ignoring errors.
CREATE VIEW test_2 AS
SELECT *
FROM read_csv_auto('test_duckdb_data.csv',
header=True,
ignore_errors=True,
delim=',',
all_varchar=True,
store_rejects=True
);
test_2
is storing rejects (or you can set ignore_errors=False
).
Once these two views have been loaded from the same data, you can test the amount of records for both. You should see a discrepancy:
WITH view_counts AS (
SELECT
(SELECT COUNT(*) FROM test_1) AS test_1_count,
(SELECT COUNT(*) FROM test_2) AS test_2_count
)
SELECT
test_1_count,
test_2_count,
ABS((test_2_count - test_1_count)) AS difference
FROM
view_counts;
And you can looking at the records that exist in test_2
but not in test_1
:
SELECT *
FROM test_2
EXCEPT
SELECT *
FROM test_1;
OS:
macOS 14.6.1, Apple M1 (AArch64)
DuckDB Version:
1.1.1
DuckDB Client:
Python
Hardware:
No response
Full Name:
Phillip Wright
Affiliation:
Shipt
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have