-
Notifications
You must be signed in to change notification settings - Fork 2.5k
csv: parse escape character in unquoted fields #14464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@pdet The PR is ready for review. Please let me know if anything needs to be changed. Thanks! |
Hi @fanyang01 Thanks for the PR. Could you add a couple more tests?
loop buffer_size 10 25
query ${number of columns}
read_csv('data/csv/unquoted_escape.csv', quote = '', escape = '\', sep = '\t')
----
${expected_result}
endloop
A bit orthogonal to this PR, but good to keep in mind is that it might also be possible to extend the sniffer to detect escaped in unquoted values. |
@pdet I'm following your suggestion to add more tests. The reader fails for more complex cases and I'm working on fixing it now. Thanks for the feedback. |
No worries, many thanks for tackling this issue! Also, one other test that should be added is with CSV Files with large (i.e., over 16 bytes) unquoted escaped values! |
That is exactly what I'm experimenting with! I chose the HumanEval dataset, which contains 164 programming problems. Three fields of this dataset include multiline Python code, with multiple occurrences of I will add the remaining tests soon. |
Hi @pdet, I have added all the tests you suggested and successfully run them on my local machine. |
|
Update: The solution that minimizes the number of broken existing tests appears to be allowing the state machine to enter an invalid state when the escape character is followed by a non-special character. Otherwise, the sniffer may select undesired dialect candidates. For instance, it might choose Additionally, for |
I apologize for previously selecting the main branch as the base. I have now retargeted this PR to the feature branch. |
Oops, it appears I should target the main branch because the new option needs to come after the |
Hi @fanyang01 I think it all looks excellent! Indeed we merged Thanks again for all the effort you put in this PR and for going over my comments and nits :-) It looks good to me. @Mytherin maybe you would like to have a look? |
@fanyang01, @pdet, @Mytherin - I hope it's not too late to comment The choice of name seems to be at confusing, because |
@pkoppstein I agree that My initial thought is to introduce a It might be better to discuss this further in a GitHub Discussion. |
@fanyang01 - Unfortunately, at this point I don't understand the new rfc_4180 flag I think part of the problem here is that in RFC 4180 and similar escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE That is, as far as RFC 4180 is concerned, there are two kinds of I suspect that DuckDB's approach to supporting multiple variations of As for "dialects" of CSV - yes, I think that a flag that
The question in my mind is whether or not dialect=rfc_4180 would An alternative to adding Anyway, to summarize, my hope is that DuckDB will support |
csv: parse escape character in unquoted fields (duckdb/duckdb#14464) Add fallback for thread count if jemalloc cannot identify (duckdb/duckdb#14688) [Python] Add `LambdaExpression` to the Python Expression API (duckdb/duckdb#14713)
csv: parse escape character in unquoted fields (duckdb/duckdb#14464) Add fallback for thread count if jemalloc cannot identify (duckdb/duckdb#14688) [Python] Add `LambdaExpression` to the Python Expression API (duckdb/duckdb#14713) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Resolves #14440
This PR enables DuckDB to parse the following TSV correctly: