Skip to content

Conversation

fanyang01
Copy link
Contributor

@fanyang01 fanyang01 commented Oct 21, 2024

Resolves #14440

This PR enables DuckDB to parse the following TSV correctly:

0	\\	0
1	\		1
2	\
	2
3	a\	a	3
4	b\
b	4
5	c\\c	5
D FROM read_csv('t.txt', quote = '', escape = '\', sep = '\t', rfc_4180 = false) t (i, s, j);
┌───────┬─────────┬───────┐
│   i   │    s    │   j   │
│ int64 │ varchar │ int64 │
├───────┼─────────┼───────┤
│     0 │ \       │     0 │
│     1 │ \t      │     1 │
│     2 │ \n      │     2 │
│     3 │ a\ta    │     3 │
│     4 │ b\nb    │     4 │
│     5 │ c\c     │     5 │
└───────┴─────────┴───────┘

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 21, 2024 10:28
@fanyang01 fanyang01 marked this pull request as ready for review October 21, 2024 10:33
@fanyang01
Copy link
Contributor Author

@pdet The PR is ready for review. Please let me know if anything needs to be changed. Thanks!

@pdet
Copy link
Contributor

pdet commented Oct 21, 2024

Hi @fanyang01 Thanks for the PR.

Could you add a couple more tests?

  1. A test with escaped values within and without quoted values in the same file.
  2. A test where the quoted and escaped values have the same character (e.g., quote = '"' and escape = '"')
  3. A test where escaped unquoted values can happen in between CSV buffers (You might want a slightly bigger file for this test)
    e.g.,
loop buffer_size 10 25

query ${number of columns}
read_csv('data/csv/unquoted_escape.csv', quote = '', escape = '\', sep = '\t')
----
${expected_result}

endloop
  1. Also test with a file with at least 30k rows.

A bit orthogonal to this PR, but good to keep in mind is that it might also be possible to extend the sniffer to detect escaped in unquoted values.

@fanyang01
Copy link
Contributor Author

@pdet I'm following your suggestion to add more tests. The reader fails for more complex cases and I'm working on fixing it now. Thanks for the feedback.

@pdet
Copy link
Contributor

pdet commented Oct 22, 2024

@pdet I'm following your suggestion to add more tests. The reader fails for more complex cases and I'm working on fixing it now. Thanks for the feedback.

No worries, many thanks for tackling this issue!

Also, one other test that should be added is with CSV Files with large (i.e., over 16 bytes) unquoted escaped values!

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 22, 2024 09:31
@fanyang01
Copy link
Contributor Author

Also, one other test that should be added is with CSV Files with large (i.e., over 16 bytes) unquoted escaped values!

That is exactly what I'm experimenting with! I chose the HumanEval dataset, which contains 164 programming problems. Three fields of this dataset include multiline Python code, with multiple occurrences of \n and , characters. I also replaced the indentation with the tab character. I have addressed the failure, and DuckDB can now read the unquoted versions of them. Is this an acceptable test?

I will add the remaining tests soon.

@fanyang01 fanyang01 marked this pull request as ready for review October 23, 2024 07:36
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 23, 2024 07:38
@fanyang01 fanyang01 marked this pull request as ready for review October 23, 2024 07:42
@fanyang01
Copy link
Contributor Author

Hi @pdet,

I have added all the tests you suggested and successfully run them on my local machine.

@fanyang01
Copy link
Contributor Author

fanyang01 commented Oct 23, 2024

The read_csv(..., quote = '') test failed for 'data/csv/quoted_values_delimited.csv':

"header1;header2;header3"
"value1;value2;value3"
"value1;value2;value3"
"value1;value2;value3"

It appears the issue is that the escape character is set to its default value of '"', which is causing confusion for the parser. Changing the escape character to something else (e.g., '') could resolve this issue, but I am uncertain whether we should proceed with this change.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 23, 2024 09:11
@fanyang01 fanyang01 marked this pull request as ready for review October 24, 2024 03:17
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 24, 2024 04:01
@fanyang01 fanyang01 marked this pull request as ready for review October 24, 2024 09:28
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 24, 2024 09:35
@fanyang01
Copy link
Contributor Author

fanyang01 commented Oct 24, 2024

Update:

The solution that minimizes the number of broken existing tests appears to be allowing the state machine to enter an invalid state when the escape character is followed by a non-special character. Otherwise, the sniffer may select undesired dialect candidates. For instance, it might choose " as the escape character when the quote is not ", leading to the entire file being parsed as a single row if each line ends with " (since the sniffer prefers a larger column count).

Additionally, for \r\n newlines, we must escape the both characters of the newline as a whole (otherwise the tests will fail on Windows). To handle this, I have introduced a new state, ESCAPED_RETURN, to represent the situation where the escape character is followed by a \r.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 2, 2024 03:47
@fanyang01 fanyang01 marked this pull request as ready for review November 2, 2024 03:52
@fanyang01 fanyang01 changed the base branch from main to feature November 5, 2024 03:04
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 5, 2024 03:15
@fanyang01 fanyang01 marked this pull request as ready for review November 5, 2024 05:35
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 5, 2024 05:40
@fanyang01
Copy link
Contributor Author

I apologize for previously selecting the main branch as the base. I have now retargeted this PR to the feature branch.

@fanyang01 fanyang01 marked this pull request as ready for review November 5, 2024 05:43
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 6, 2024 06:58
@fanyang01 fanyang01 changed the base branch from feature to main November 6, 2024 06:59
@fanyang01
Copy link
Contributor Author

Oops, it appears I should target the main branch because the new option needs to come after the encoding option introduced in #14560.

@fanyang01 fanyang01 marked this pull request as ready for review November 6, 2024 07:03
@pdet
Copy link
Contributor

pdet commented Nov 6, 2024

Hi @fanyang01 I think it all looks excellent! Indeed we merged feature with main yesterday, so it should be now targetingmain!

Thanks again for all the effort you put in this PR and for going over my comments and nits :-)

It looks good to me.

@Mytherin maybe you would like to have a look?

@Mytherin Mytherin merged commit 0ccf3c2 into duckdb:main Nov 6, 2024
42 checks passed
@fanyang01 fanyang01 deleted the csv-escape branch November 6, 2024 14:35
@pkoppstein
Copy link

@fanyang01, @pdet, @Mytherin - I hope it's not too late to comment
on the choice of name for the new flag, namely rfc_4180.

The choice of name seems to be at confusing, because
RFC4180 bundles together several requirements, and does
so in a rather unsatisfactory way. Using this particular name
drags in a whole bunch of dirty laundry that can only add to
confusion and/or complexity.

@fanyang01
Copy link
Contributor Author

@pkoppstein I agree that rfc_4180 = false appears somewhat confusing. What would you suggest instead?

My initial thought is to introduce a dialect option with the default value set to rfc_4180. We could then enable the feature in this PR by setting dialect = 'pg_text_format' or dialect = 'mysql'.

It might be better to discuss this further in a GitHub Discussion.

@pkoppstein
Copy link

@fanyang01 - Unfortunately, at this point I don't understand the new rfc_4180 flag
well enough to be able to offer a helpful alternative name. In fact,
your description ("remove the escape character in unquoted fields")
leaves me wondering whether the effect you're trying to achieve might
not be achieved in some other way with fewer complications.

I think part of the problem here is that in RFC 4180 and similar
interpretations of CSV, the doubling of DQUOTE was not intended
to define a general "escape character" mechanism. The relevant line from the RFC is:

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

That is, as far as RFC 4180 is concerned, there are two kinds of
fields: "escaped fields" and "ordinary fields". There is no mention
of an "escape character", so as soon as we introduce that as a
separate idea, complications start to pile up.

I suspect that DuckDB's approach to supporting multiple variations of
multiple CSV dialects would in the end have been better served by having
separate flags for an "escape character" (in the usual sense of a
single character like ''), and the policy of doubling the "quote"
character (whatever it might be) in an "escaped field".


As for "dialects" of CSV - yes, I think that a flag that
would clearly specify a certain behavior and how it is affected
by all the feature-specific flags would be an excellent way to proceed:

  • using a flag name such as "dialect" (or "mode" or "defaults")
    would make it clear that only one dialect at a time can be specified;
  • the documentation would clarify exactly which flag values it implies;
  • the semantics would also be clear enough (i.e. explicit per-feature
    flags would always over-ride the "dialect" setting).

The question in my mind is whether or not dialect=rfc_4180 would
be equivalent to a constellation of feature-specific flags.
At first I was inclined to think that such "dialect" flags
should be no more than such a constellation, but I'm worried
that there are too many complications. As an exercise, though, it would
probably pay off to see what additional flags would be needed, and
how they'd interact with each other.

An alternative to adding dialects that you've probably already considered would be
to offer more than just read_csv() and read_csv_auto() functions.
I've often wished it were easier to read and write TSV files in accordance with
a particular set of requirements.

Anyway, to summarize, my hope is that DuckDB will support
a small handful of dialects by providing a clear exposition of exactly what
each one means, especially in relation to the other feature-specific flags.

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 21, 2024
csv: parse escape character in unquoted fields (duckdb/duckdb#14464)
Add fallback for thread count if jemalloc cannot identify (duckdb/duckdb#14688)
[Python] Add `LambdaExpression` to the Python Expression API (duckdb/duckdb#14713)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 21, 2024
csv: parse escape character in unquoted fields (duckdb/duckdb#14464)
Add fallback for thread count if jemalloc cannot identify (duckdb/duckdb#14688)
[Python] Add `LambdaExpression` to the Python Expression API (duckdb/duckdb#14713)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants