-
Notifications
You must be signed in to change notification settings - Fork 2.6k
CSV Rejects Tables 2.0 #11512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV Rejects Tables 2.0 #11512
Conversation
This is awesome! Would it be possible to add something like MISSING VALUES? Where a column which does exist, but with a NULL value, will error out? I'm currently doing CSV validation via JSON schema, where I check datatypes, missing columns, and missing values. Reading this PR description I can replace it almost fully, in my usecase at least, with the rejects table in duckdb which will be a lot more performant. But I recognize that NULL values have nothing to do with a valid / invalid CSV structure :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good. Some comments below:
"name": "rejects_table_name", | ||
"type": "string" | ||
"name": "store_rejects", | ||
"type": "CSVOption<bool>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't change the type of fields, unless the serialization of the types is identical
Absolutely. I think that's a bit orthogonal to this PR, but I'm happy to add something in this direction in the near-future. :-) |
Thanks! |
Merge pull request duckdb/duckdb#11512 from pdet/rejects_tables_2.0
👏 Awesome 🤩 |
@pdet first of all thank you very much, it's really useful, it's really a great job. I don't understand one of the errors in your example, Is there a limit? And where is this limit set? Thank you again |
I'm stupid :) We have |
Hi @pdet I have tested it and I have some notes. My input file:
If I run
I have the below errors. My notes:
|
@pdet I'm adding another note. If I have this
I have a cast error in row 7: "Error when converting column ""compleanno"". date field value out of range: ""gennaio"", expected format is (YYYY-MM-DD)" |
This PR extends the current implementation of the CSV Rejects Tables.
The following errors can now be registered in a rejects table:
They will also produce two temporary tables.
reject_scans - Stores information about the CSV Scan, including configuration.
Columns:
reject_errors - Stores errors that happened in a given scan.
Columns:
There are also 4 parameters that can be used for the rejects tables
Example:
Also, if multiple errors occur on the same line, they will all be reported. The exception to this is related to casting errors during the Flush method (that is, not implicit casts). This situation should improve as we implement more implicit casts.