Skip to content

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Apr 4, 2024

This PR extends the current implementation of the CSV Rejects Tables.

The following errors can now be registered in a rejects table:

  1. CAST
  2. MISSING COLUMNS
  3. TOO MANY COLUMNS
  4. UNQUOTED VALUE
  5. LINE SIZE OVER MAXIMUM
  6. INVALID UNICODE

They will also produce two temporary tables.
reject_scans - Stores information about the CSV Scan, including configuration.
Columns:

  1. scan_id
  2. file_id
  3. delimiter
  4. quote
  5. escape
  6. newline_delimiter
  7. skip_rows
  8. has_header
  9. columns
  10. date_format
  11. timestamp_format
  12. user_arguments

reject_errors - Stores errors that happened in a given scan.
Columns:

  1. scan_id
  2. file_id
  3. line : Line number where error happened
  4. line_byte_position: Byte position where faulty line starts
  5. byte_positon: Byte position of the actual error
  6. column_idx
  7. column_name
  8. error_type: Enum with error type
  9. csv_line: Original CSV line where error happened
  10. error_message: Error Message constructed in DuckDB.

There are also 4 parameters that can be used for the rejects tables

  1. store_rejects: boolean, defines if we will store rejects, by default it stores on tables named reject_scans and reject_errors
  2. rejects_table: VARCHAR, optional table where errors will be stored. By default it's reject_errors.
  3. rejects_scans: VARCHAR, optional table where scan information about the errors will be stored. By default it's reject_scans.
  4. rejects_limit: BIGINT, optional variable indicating how many errors should be stored

Example:

name,age,current_day, barks
oogie boogie,3, 2023-01-01, 2
oogie boogie,3, 2023-01-02, 5
oogie boogie,3, 2023-01-03, bla, 7
oogie boogie,3, bla, bla, 7
"oogie boogie"bla,3, 2023-01-04
oogie boogie,3, bla
oogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogie,3, bla
query IIII
FROM read_csv('data/csv/rejects/multiple_errors/multiple_errors.csv',
    columns = {'name': 'VARCHAR', 'age': 'INTEGER', 'current_day': 'DATE', 'barks': 'INTEGER'},
    store_rejects = true, auto_detect=false, header = 1, max_line_size=40);
----
oogie boogie	3	2023-01-01	2
oogie boogie	3	2023-01-02	5

query IIIIIIIIIIIII rowsort
FROM reject_scans ORDER BY ALL;
----
71	0	data/csv/rejects/multiple_errors/multiple_errors.csv	,	"	"	\n	0	true	{'name': 'VARCHAR','age': 'INTEGER','current_day': 'DATE','barks': 'INTEGER'}	NULL	NULL	max_line_size='40', header=true, store_rejects=true


query IIIIIIIIII rowsort
FROM reject_errors ORDER BY ALL;
----
71	0	4	89	116	4	barks	CAST	oogie boogie,3, 2023-01-03, bla, 7	Error when converting column "barks". Could not convert string " bla" to 'INTEGER'
71	0	4	89	120	5	NULL	TOO MANY COLUMNS	oogie boogie,3, 2023-01-03, bla, 7	Expected Number of Columns: 4 Found: 5
71	0	5	124	144	4	barks	CAST	oogie boogie,3, bla, bla, 7	Error when converting column "barks". Could not convert string " bla" to 'INTEGER'
71	0	5	124	148	5	NULL	TOO MANY COLUMNS	oogie boogie,3, bla, bla, 7	Expected Number of Columns: 4 Found: 5
71	0	6	152	152	1	name	UNQUOTED VALUE	"oogie boogie"bla,3, 2023-01-04	Value with unterminated quote found.
71	0	6	152	183	3	barks	MISSING COLUMNS	"oogie boogie"bla,3, 2023-01-04	Expected Number of Columns: 4 Found: 3
71	0	7	184	203	3	barks	MISSING COLUMNS	oogie boogie,3, bla	Expected Number of Columns: 4 Found: 3
71	0	8	204	204	NULL	NULL	LINE SIZE OVER MAXIMUM	oogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogie,3, bla	Maximum line size of 40 bytes exceeded. Actual Size:92 bytes.
71	0	8	204	295	3	barks	MISSING COLUMNS	oogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogieoogie boogie,3, bla	Expected Number of Columns: 4 Found: 3

Also, if multiple errors occur on the same line, they will all be reported. The exception to this is related to casting errors during the Flush method (that is, not implicit casts). This situation should improve as we implement more implicit casts.

@github-actions github-actions bot marked this pull request as draft April 5, 2024 09:17
@pdet pdet marked this pull request as ready for review April 5, 2024 09:25
@darthf1
Copy link

darthf1 commented Apr 5, 2024

This is awesome!

Would it be possible to add something like MISSING VALUES? Where a column which does exist, but with a NULL value, will error out?

I'm currently doing CSV validation via JSON schema, where I check datatypes, missing columns, and missing values.

Reading this PR description I can replace it almost fully, in my usecase at least, with the rejects table in duckdb which will be a lot more performant.

But I recognize that NULL values have nothing to do with a valid / invalid CSV structure :)

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good. Some comments below:

"name": "rejects_table_name",
"type": "string"
"name": "store_rejects",
"type": "CSVOption<bool>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change the type of fields, unless the serialization of the types is identical

@pdet
Copy link
Contributor Author

pdet commented Apr 6, 2024

This is awesome!

Would it be possible to add something like MISSING VALUES? Where a column which does exist, but with a NULL value, will error out?

I'm currently doing CSV validation via JSON schema, where I check datatypes, missing columns, and missing values.

Reading this PR description I can replace it almost fully, in my usecase at least, with the rejects table in duckdb which will be a lot more performant.

But I recognize that NULL values have nothing to do with a valid / invalid CSV structure :)

Absolutely. I think that's a bit orthogonal to this PR, but I'm happy to add something in this direction in the near-future. :-)

@Mytherin Mytherin marked this pull request as draft April 9, 2024 11:04
@Mytherin Mytherin marked this pull request as ready for review April 9, 2024 11:04
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 11, 2024 09:35
@pdet pdet marked this pull request as ready for review April 11, 2024 09:48
@Mytherin Mytherin merged commit 41419f3 into duckdb:main Apr 11, 2024
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Apr 11, 2024
Merge pull request duckdb/duckdb#11512 from pdet/rejects_tables_2.0
@adriens
Copy link

adriens commented Apr 11, 2024

👏 Awesome 🤩

@aborruso
Copy link

@pdet first of all thank you very much, it's really useful, it's really a great job.

I don't understand one of the errors in your example, SIZE OVER MAXIMUM: "Maximum line size of 40 bytes exceeded. Actual Size:92 bytes".

Is there a limit? And where is this limit set?

Thank you again

@aborruso
Copy link

Is there a limit? And where is this limit set?

I'm stupid :)

We have max_line_size=40

@aborruso
Copy link

aborruso commented Apr 13, 2024

Hi @pdet I have tested it and I have some notes.

My input file:

nome,compleanno,altezza
Maurizio,1992-12-27,187
Paola,gennaio,162
Andy,1973-07-06,176,Palermo

Chiara,1991-02-02,162

If I run

FROM read_csv('tmp.csv',columns = {'nome': 'VARCHAR', 'compleanno': 'DATE', 'altezza': 'INTEGER'}, store_rejects = true, auto_detect=false, header = 1);

I have the below errors. My notes:

  • In the error log I read a CAST error in line 5. It doesn't seem right to me, I have a cast error in line 3;
  • reading the file, it correctly skips line 5, but I cannot find the error in the log. How about introducing the empty_ row error?
scan_id file_id line line_byte_position byte_position column_idx column_name error_type csv_line error_message
7 0 4 67 86 4 TOO MANY COLUMNS Andy,1973-07-06,176,Palermo Expected Number of Columns: 3 Found: 4
7 0 5 49 2 compleanno CAST Paola,gennaio,162 Error when converting column "compleanno". date field value out of range: "gennaio", expected format is (YYYY-MM-DD)

@aborruso
Copy link

@pdet I'm adding another note.

If I have this

nome,compleanno,altezza
Maurizio,1992-12-27,187
Paola,gennaio,162
Andy,1973-07-06,176,Palermo

Chiara,02-02,162
Mario

I have a cast error in row 7: "Error when converting column ""compleanno"". date field value out of range: ""gennaio"", expected format is (YYYY-MM-DD)"
That cast error is in row 6.

@pdet pdet deleted the rejects_tables_2.0 branch June 25, 2024 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants