CSV Rejects Tables 2.0 #11512

pdet · 2024-04-04T15:26:33Z

This PR extends the current implementation of the CSV Rejects Tables.

The following errors can now be registered in a rejects table:

CAST
MISSING COLUMNS
TOO MANY COLUMNS
UNQUOTED VALUE
LINE SIZE OVER MAXIMUM
INVALID UNICODE

They will also produce two temporary tables.
reject_scans - Stores information about the CSV Scan, including configuration.
Columns:

scan_id
file_id
delimiter
quote
escape
newline_delimiter
skip_rows
has_header
columns
date_format
timestamp_format
user_arguments

reject_errors - Stores errors that happened in a given scan.
Columns:

scan_id
file_id
line : Line number where error happened
line_byte_position: Byte position where faulty line starts
byte_positon: Byte position of the actual error
column_idx
column_name
error_type: Enum with error type
csv_line: Original CSV line where error happened
error_message: Error Message constructed in DuckDB.

There are also 4 parameters that can be used for the rejects tables

store_rejects: boolean, defines if we will store rejects, by default it stores on tables named reject_scans and reject_errors
rejects_table: VARCHAR, optional table where errors will be stored. By default it's reject_errors.
rejects_scans: VARCHAR, optional table where scan information about the errors will be stored. By default it's reject_scans.
rejects_limit: BIGINT, optional variable indicating how many errors should be stored

Example:

name,age,current_day, barks
oogie boogie,3, 2023-01-01, 2
oogie boogie,3, 2023-01-02, 5
oogie boogie,3, 2023-01-03, bla, 7
oogie boogie,3, bla, bla, 7
"oogie boogie"bla,3, 2023-01-04
oogie boogie,3, bla
oogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogie,3, bla

query IIII
FROM read_csv('data/csv/rejects/multiple_errors/multiple_errors.csv',
    columns = {'name': 'VARCHAR', 'age': 'INTEGER', 'current_day': 'DATE', 'barks': 'INTEGER'},
    store_rejects = true, auto_detect=false, header = 1, max_line_size=40);
----
oogie boogie	3	2023-01-01	2
oogie boogie	3	2023-01-02	5

query IIIIIIIIIIIII rowsort
FROM reject_scans ORDER BY ALL;
----
71	0	data/csv/rejects/multiple_errors/multiple_errors.csv	,	"	"	\n	0	true	{'name': 'VARCHAR','age': 'INTEGER','current_day': 'DATE','barks': 'INTEGER'}	NULL	NULL	max_line_size='40', header=true, store_rejects=true


query IIIIIIIIII rowsort
FROM reject_errors ORDER BY ALL;
----
71	0	4	89	116	4	barks	CAST	oogie boogie,3, 2023-01-03, bla, 7	Error when converting column "barks". Could not convert string " bla" to 'INTEGER'
71	0	4	89	120	5	NULL	TOO MANY COLUMNS	oogie boogie,3, 2023-01-03, bla, 7	Expected Number of Columns: 4 Found: 5
71	0	5	124	144	4	barks	CAST	oogie boogie,3, bla, bla, 7	Error when converting column "barks". Could not convert string " bla" to 'INTEGER'
71	0	5	124	148	5	NULL	TOO MANY COLUMNS	oogie boogie,3, bla, bla, 7	Expected Number of Columns: 4 Found: 5
71	0	6	152	152	1	name	UNQUOTED VALUE	"oogie boogie"bla,3, 2023-01-04	Value with unterminated quote found.
71	0	6	152	183	3	barks	MISSING COLUMNS	"oogie boogie"bla,3, 2023-01-04	Expected Number of Columns: 4 Found: 3
71	0	7	184	203	3	barks	MISSING COLUMNS	oogie boogie,3, bla	Expected Number of Columns: 4 Found: 3
71	0	8	204	204	NULL	NULL	LINE SIZE OVER MAXIMUM	oogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogie,3, bla	Maximum line size of 40 bytes exceeded. Actual Size:92 bytes.
71	0	8	204	295	3	barks	MISSING COLUMNS	oogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogie,3, bla	Expected Number of Columns: 4 Found: 3

Also, if multiple errors occur on the same line, they will all be reported. The exception to this is related to casting errors during the Flush method (that is, not implicit casts). This situation should improve as we implement more implicit casts.

…d as this sounds

darthf1 · 2024-04-05T10:26:01Z

This is awesome!

Would it be possible to add something like MISSING VALUES? Where a column which does exist, but with a NULL value, will error out?

I'm currently doing CSV validation via JSON schema, where I check datatypes, missing columns, and missing values.

Reading this PR description I can replace it almost fully, in my usecase at least, with the rejects table in duckdb which will be a lot more performant.

But I recognize that NULL values have nothing to do with a valid / invalid CSV structure :)

Mytherin

Thanks for the PR! Looks good. Some comments below:

src/catalog/catalog.cpp

third_party/utf8proc/utf8proc_wrapper.cpp

test/sql/copy/csv/rejects/csv_rejects_two_tables.test

src/storage/serialization/serialize_nodes.cpp

src/include/duckdb/storage/serialization/nodes.json

Mytherin · 2024-04-05T10:45:41Z

src/include/duckdb/storage/serialization/nodes.json

-        "name": "rejects_table_name",
-        "type": "string"
+        "name": "store_rejects",
+        "type": "CSVOption<bool>"


We can't change the type of fields, unless the serialization of the types is identical

src/include/duckdb/storage/serialization/nodes.json

src/execution/operator/csv_scanner/util/csv_error.cpp

pdet · 2024-04-06T08:26:03Z

This is awesome!

Would it be possible to add something like MISSING VALUES? Where a column which does exist, but with a NULL value, will error out?

I'm currently doing CSV validation via JSON schema, where I check datatypes, missing columns, and missing values.

Reading this PR description I can replace it almost fully, in my usecase at least, with the rejects table in duckdb which will be a lot more performant.

But I recognize that NULL values have nothing to do with a valid / invalid CSV structure :)

Absolutely. I think that's a bit orthogonal to this PR, but I'm happy to add something in this direction in the near-future. :-)

Mytherin · 2024-04-11T14:22:52Z

Thanks!

Merge pull request duckdb/duckdb#11512 from pdet/rejects_tables_2.0

adriens · 2024-04-11T22:46:19Z

👏 Awesome 🤩

aborruso · 2024-04-13T06:48:52Z

@pdet first of all thank you very much, it's really useful, it's really a great job.

I don't understand one of the errors in your example, SIZE OVER MAXIMUM: "Maximum line size of 40 bytes exceeded. Actual Size:92 bytes".

Is there a limit? And where is this limit set?

Thank you again

aborruso · 2024-04-13T06:56:06Z

Is there a limit? And where is this limit set?

I'm stupid :)

We have max_line_size=40

aborruso · 2024-04-13T07:07:08Z

Hi @pdet I have tested it and I have some notes.

My input file:

nome,compleanno,altezza
Maurizio,1992-12-27,187
Paola,gennaio,162
Andy,1973-07-06,176,Palermo

Chiara,1991-02-02,162

If I run

FROM read_csv('tmp.csv',columns = {'nome': 'VARCHAR', 'compleanno': 'DATE', 'altezza': 'INTEGER'}, store_rejects = true, auto_detect=false, header = 1);

I have the below errors. My notes:

In the error log I read a CAST error in line 5. It doesn't seem right to me, I have a cast error in line 3;
reading the file, it correctly skips line 5, but I cannot find the error in the log. How about introducing the empty_ row error?

scan_id	file_id	line	line_byte_position	byte_position	column_idx	column_name	error_type	csv_line	error_message
7	0	4	67	86	4		TOO MANY COLUMNS	Andy,1973-07-06,176,Palermo	Expected Number of Columns: 3 Found: 4
7	0	5	49		2	compleanno	CAST	Paola,gennaio,162	Error when converting column "compleanno". date field value out of range: "gennaio", expected format is (YYYY-MM-DD)

aborruso · 2024-04-13T07:13:16Z

@pdet I'm adding another note.

If I have this

nome,compleanno,altezza
Maurizio,1992-12-27,187
Paola,gennaio,162
Andy,1973-07-06,176,Palermo

Chiara,02-02,162
Mario

I have a cast error in row 7: "Error when converting column ""compleanno"". date field value out of range: ""gennaio"", expected format is (YYYY-MM-DD)"
That cast error is in row 6.

pdet added 30 commits February 21, 2024 11:05

Get rid of rejects_recovery_columns

0357a33

pesky bee

62d8dec

wip commit

e7bfcd6

wip

bf320b4

Merge branch 'main' into rejects_tables_2.0

2d761bd

Enum for CSV Errors, cleaning up output

ba93182

Several tweaks for the tables

551865f

We can also store the global byte where an error shows up

5eae50f

Fixing up old tests and fixing small bugs

5951413

remove old parameter, cleanup of older tests

a8f2dcd

Handle CSV Line Errors that fall over multiple buffers

bcebeff

rounding off minor details

8ab1a9a

lots of adjustments to make the errors accurate for small buffer sizes

b6fd567

When resetting the buffer_handle we might have to keep the last one

7ad39f2

fix test

2826f95

We only care about propagting errors if we are ignoring them, as weir…

f6496fc

…d as this sounds

woopsie-doopsie

404c4da

wip on rejects from flush cast

ecf76d4

introduce FullLinePosition

53f612b

Errors during flush being properly propagated

4403d16

preparing the ground for the other errors

7129aaf

Merge remote-tracking branch 'origin/main' into rejects_tables_2.0

7d03e73

All rejects tests pass with vector_size=2

7cecc4b

WIP on column amount incorrect

8222dd4

merge

dce4f61

remainnig merge

d9ffcb9

line error fix

d54bf5b

Get information on too many columns right

67703eb

More tests for different incorrect column amounts

12542a2

WIP on sanitizing invalid utfs and more on utf rejects tables

8ea5f0a

github-actions bot marked this pull request as draft April 5, 2024 09:17

pdet marked this pull request as ready for review April 5, 2024 09:25

Mytherin reviewed Apr 5, 2024

View reviewed changes

Mytherin added the Changes Requested label Apr 5, 2024

pdet added 8 commits April 8, 2024 10:59

Remove extra catalog function

3b561cb

more pr requests, adding tests and assertions

df3d7ff

Also print error_line when throwing

7e07831

Current File Index

2f7be72

Maybe this is fine for the serializer?

88454e4

Go away utility

2fb9648

some special code for options

3870eda

One more test still having scan_ids

41cd77a

Mytherin marked this pull request as draft April 9, 2024 11:04

Mytherin marked this pull request as ready for review April 9, 2024 11:04

pdet added Ready For Review and removed Changes Requested labels Apr 11, 2024

merging

bbe0eaf

duckdb-draftbot marked this pull request as draft April 11, 2024 09:35

pdet marked this pull request as ready for review April 11, 2024 09:48

Mytherin merged commit 41419f3 into duckdb:main Apr 11, 2024

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Apr 11, 2024

chore: Update vendored sources to duckdb/duckdb@41419f3

457620e

Merge pull request duckdb/duckdb#11512 from pdet/rejects_tables_2.0

pdet deleted the rejects_tables_2.0 branch June 25, 2024 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CSV Rejects Tables 2.0 #11512

CSV Rejects Tables 2.0 #11512

Uh oh!

pdet commented Apr 4, 2024

Uh oh!

darthf1 commented Apr 5, 2024 •

edited

Loading

Uh oh!

Mytherin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin Apr 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pdet commented Apr 6, 2024

Uh oh!

Mytherin commented Apr 11, 2024

Uh oh!

adriens commented Apr 11, 2024

Uh oh!

aborruso commented Apr 13, 2024

Uh oh!

aborruso commented Apr 13, 2024

Uh oh!

aborruso commented Apr 13, 2024 •

edited

Loading

Uh oh!

aborruso commented Apr 13, 2024

Uh oh!

Uh oh!

CSV Rejects Tables 2.0 #11512

CSV Rejects Tables 2.0 #11512

Uh oh!

Conversation

pdet commented Apr 4, 2024

Uh oh!

darthf1 commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin Apr 5, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pdet commented Apr 6, 2024

Uh oh!

Mytherin commented Apr 11, 2024

Uh oh!

adriens commented Apr 11, 2024

Uh oh!

aborruso commented Apr 13, 2024

Uh oh!

aborruso commented Apr 13, 2024

Uh oh!

aborruso commented Apr 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aborruso commented Apr 13, 2024

Uh oh!

Uh oh!

darthf1 commented Apr 5, 2024 •

edited

Loading

aborruso commented Apr 13, 2024 •

edited

Loading