csv: parse escape character in unquoted fields #14464

fanyang01 · 2024-10-21T10:21:36Z

Resolves #14440

This PR enables DuckDB to parse the following TSV correctly:

0	\\	0
1	\		1
2	\
	2
3	a\	a	3
4	b\
b	4
5	c\\c	5

D FROM read_csv('t.txt', quote = '', escape = '\', sep = '\t', rfc_4180 = false) t (i, s, j);
┌───────┬─────────┬───────┐
│   i   │    s    │   j   │
│ int64 │ varchar │ int64 │
├───────┼─────────┼───────┤
│     0 │ \       │     0 │
│     1 │ \t      │     1 │
│     2 │ \n      │     2 │
│     3 │ a\ta    │     3 │
│     4 │ b\nb    │     4 │
│     5 │ c\c     │     5 │
└───────┴─────────┴───────┘

fanyang01 · 2024-10-21T10:34:01Z

@pdet The PR is ready for review. Please let me know if anything needs to be changed. Thanks!

pdet · 2024-10-21T10:49:47Z

Hi @fanyang01 Thanks for the PR.

Could you add a couple more tests?

A test with escaped values within and without quoted values in the same file.
A test where the quoted and escaped values have the same character (e.g., quote = '"' and escape = '"')
A test where escaped unquoted values can happen in between CSV buffers (You might want a slightly bigger file for this test)
e.g.,

loop buffer_size 10 25

query ${number of columns}
read_csv('data/csv/unquoted_escape.csv', quote = '', escape = '\', sep = '\t')
----
${expected_result}

endloop

Also test with a file with at least 30k rows.

A bit orthogonal to this PR, but good to keep in mind is that it might also be possible to extend the sniffer to detect escaped in unquoted values.

fanyang01 · 2024-10-22T06:35:23Z

@pdet I'm following your suggestion to add more tests. The reader fails for more complex cases and I'm working on fixing it now. Thanks for the feedback.

pdet · 2024-10-22T08:46:40Z

@pdet I'm following your suggestion to add more tests. The reader fails for more complex cases and I'm working on fixing it now. Thanks for the feedback.

No worries, many thanks for tackling this issue!

Also, one other test that should be added is with CSV Files with large (i.e., over 16 bytes) unquoted escaped values!

fanyang01 · 2024-10-22T09:50:02Z

Also, one other test that should be added is with CSV Files with large (i.e., over 16 bytes) unquoted escaped values!

That is exactly what I'm experimenting with! I chose the HumanEval dataset, which contains 164 programming problems. Three fields of this dataset include multiline Python code, with multiple occurrences of \n and , characters. I also replaced the indentation with the tab character. I have addressed the failure, and DuckDB can now read the unquoted versions of them. Is this an acceptable test?

I will add the remaining tests soon.

fanyang01 · 2024-10-23T07:52:21Z

Hi @pdet,

I have added all the tests you suggested and successfully run them on my local machine.

fanyang01 · 2024-10-23T09:06:16Z

~~The read_csv(..., quote = '') test failed for 'data/csv/quoted_values_delimited.csv':~~

"header1;header2;header3"
"value1;value2;value3"
"value1;value2;value3"
"value1;value2;value3"

It appears the issue is that the escape character is set to its default value of '"', which is causing confusion for the parser. Changing the escape character to something else (e.g., '') could resolve this issue, but I am uncertain whether we should proceed with this change.

fanyang01 · 2024-10-24T09:50:28Z

Update:

The solution that minimizes the number of broken existing tests appears to be allowing the state machine to enter an invalid state when the escape character is followed by a non-special character. Otherwise, the sniffer may select undesired dialect candidates. For instance, it might choose " as the escape character when the quote is not ", leading to the entire file being parsed as a single row if each line ends with " (since the sniffer prefers a larger column count).

Additionally, for \r\n newlines, we must escape the both characters of the newline as a whole (otherwise the tests will fail on Windows). To handle this, I have introduced a new state, ESCAPED_RETURN, to represent the situation where the escape character is followed by a \r.

fanyang01 · 2024-11-05T05:43:00Z

I apologize for previously selecting the main branch as the base. I have now retargeted this PR to the feature branch.

fanyang01 · 2024-11-06T07:03:44Z

Oops, it appears I should target the main branch because the new option needs to come after the encoding option introduced in #14560.

pdet · 2024-11-06T13:16:03Z

Hi @fanyang01 I think it all looks excellent! Indeed we merged feature with main yesterday, so it should be now targetingmain!

Thanks again for all the effort you put in this PR and for going over my comments and nits :-)

It looks good to me.

@Mytherin maybe you would like to have a look?

pkoppstein · 2024-11-07T20:25:53Z

@fanyang01, @pdet, @Mytherin - I hope it's not too late to comment
on the choice of name for the new flag, namely rfc_4180.

The choice of name seems to be at confusing, because
RFC4180 bundles together several requirements, and does
so in a rather unsatisfactory way. Using this particular name
drags in a whole bunch of dirty laundry that can only add to
confusion and/or complexity.

fanyang01 · 2024-11-08T09:53:11Z

@pkoppstein I agree that rfc_4180 = false appears somewhat confusing. What would you suggest instead?

My initial thought is to introduce a dialect option with the default value set to rfc_4180. We could then enable the feature in this PR by setting dialect = 'pg_text_format' or dialect = 'mysql'.

It might be better to discuss this further in a GitHub Discussion.

pkoppstein · 2024-11-08T23:49:04Z

@fanyang01 - Unfortunately, at this point I don't understand the new rfc_4180 flag
well enough to be able to offer a helpful alternative name. In fact,
your description ("remove the escape character in unquoted fields")
leaves me wondering whether the effect you're trying to achieve might
not be achieved in some other way with fewer complications.

I think part of the problem here is that in RFC 4180 and similar
interpretations of CSV, the doubling of DQUOTE was not intended
to define a general "escape character" mechanism. The relevant line from the RFC is:

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

That is, as far as RFC 4180 is concerned, there are two kinds of
fields: "escaped fields" and "ordinary fields". There is no mention
of an "escape character", so as soon as we introduce that as a
separate idea, complications start to pile up.

I suspect that DuckDB's approach to supporting multiple variations of
multiple CSV dialects would in the end have been better served by having
separate flags for an "escape character" (in the usual sense of a
single character like ''), and the policy of doubling the "quote"
character (whatever it might be) in an "escaped field".

As for "dialects" of CSV - yes, I think that a flag that
would clearly specify a certain behavior and how it is affected
by all the feature-specific flags would be an excellent way to proceed:

using a flag name such as "dialect" (or "mode" or "defaults")
would make it clear that only one dialect at a time can be specified;
the documentation would clarify exactly which flag values it implies;
the semantics would also be clear enough (i.e. explicit per-feature
flags would always over-ride the "dialect" setting).

The question in my mind is whether or not dialect=rfc_4180 would
be equivalent to a constellation of feature-specific flags.
At first I was inclined to think that such "dialect" flags
should be no more than such a constellation, but I'm worried
that there are too many complications. As an exercise, though, it would
probably pay off to see what additional flags would be needed, and
how they'd interact with each other.

An alternative to adding dialects that you've probably already considered would be
to offer more than just read_csv() and read_csv_auto() functions.
I've often wished it were easier to read and write TSV files in accordance with
a particular set of requirements.

Anyway, to summarize, my hope is that DuckDB will support
a small handful of dialects by providing a clear exposition of exactly what
each one means, especially in relation to the other feature-specific flags.

csv: parse escape character in unquoted fields (duckdb/duckdb#14464) Add fallback for thread count if jemalloc cannot identify (duckdb/duckdb#14688) [Python] Add `LambdaExpression` to the Python Expression API (duckdb/duckdb#14713)

csv: parse escape character in unquoted fields (duckdb/duckdb#14464) Add fallback for thread count if jemalloc cannot identify (duckdb/duckdb#14688) [Python] Add `LambdaExpression` to the Python Expression API (duckdb/duckdb#14713) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

fanyang01 added 2 commits October 21, 2024 18:18

csv: parse escape character in unquoted fields

a8bd0ae

run format fix

f1e5e33

duckdb-draftbot marked this pull request as draft October 21, 2024 10:28

fanyang01 marked this pull request as ready for review October 21, 2024 10:33

fanyang01 mentioned this pull request Oct 21, 2024

CSV reader incompatibility apecloud/myduckserver#104

Open

test: add csv files whose fields contain source code

5f19273

duckdb-draftbot marked this pull request as draft October 22, 2024 09:31

Add more unquoted & escaped CSV tests

41a674f

fanyang01 marked this pull request as ready for review October 23, 2024 07:36

Run format fix

3a17d59

duckdb-draftbot marked this pull request as draft October 23, 2024 07:38

fanyang01 marked this pull request as ready for review October 23, 2024 07:42

test: add escape character hint

01cafb7

duckdb-draftbot marked this pull request as draft October 23, 2024 09:11

test: add escape character hint for read_csv

42fc541

fanyang01 marked this pull request as ready for review October 24, 2024 03:17

test: add escape character hint for read_csv

cd6e7f2

duckdb-draftbot marked this pull request as draft October 24, 2024 04:01

csv: invalidate the state machine if escape char preceds ordinary char

2e88e4c

fanyang01 marked this pull request as ready for review October 24, 2024 09:28

Run format fix

e50db01

duckdb-draftbot marked this pull request as draft October 24, 2024 09:35

Do not skip escape character in standard state only if rfc_4180 = false

186ba92

duckdb-draftbot marked this pull request as draft November 2, 2024 03:47

isNull -> is_null

ceb52a6

fanyang01 marked this pull request as ready for review November 2, 2024 03:52

fanyang01 changed the base branch from main to feature November 5, 2024 03:04

Merge branch 'feature' into csv-escape

b4b750d

duckdb-draftbot marked this pull request as draft November 5, 2024 03:15

Run format fix

d52fad7

fanyang01 marked this pull request as ready for review November 5, 2024 05:35

Fix EnumUtil

683c18a

duckdb-draftbot marked this pull request as draft November 5, 2024 05:40

fanyang01 marked this pull request as ready for review November 5, 2024 05:43

Merge branch 'main' into csv-escape

b92b3e1

duckdb-draftbot marked this pull request as draft November 6, 2024 06:58

fanyang01 changed the base branch from feature to main November 6, 2024 06:59

fanyang01 marked this pull request as ready for review November 6, 2024 07:03

Mytherin merged commit 0ccf3c2 into duckdb:main Nov 6, 2024
42 checks passed

fanyang01 deleted the csv-escape branch November 6, 2024 14:35

fanyang01 mentioned this pull request Nov 15, 2024

pg: FORMAT TEXT is not supported for COPY FROM STDIN apecloud/myduckserver#157

Closed

github-actions bot mentioned this pull request Dec 21, 2024

vendor: Update vendored sources to duckdb/duckdb@0ccf3c25ccbb25fb90616e77b38f6d138f82950d duckdb/duckdb-r#657

Merged

fanyang01 mentioned this pull request Jan 3, 2025

Errors while trying to setup mysql replica apecloud/myduckserver#329

Open

fanyang01 mentioned this pull request Jan 9, 2025

DuckDB 1.2 Release TODO List apecloud/myduckserver#353

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

csv: parse escape character in unquoted fields #14464

csv: parse escape character in unquoted fields #14464

Uh oh!

fanyang01 commented Oct 21, 2024 •

edited

Loading

Uh oh!

fanyang01 commented Oct 21, 2024

Uh oh!

pdet commented Oct 21, 2024

Uh oh!

fanyang01 commented Oct 22, 2024

Uh oh!

pdet commented Oct 22, 2024

Uh oh!

fanyang01 commented Oct 22, 2024

Uh oh!

fanyang01 commented Oct 23, 2024

Uh oh!

fanyang01 commented Oct 23, 2024 •

edited

Loading

Uh oh!

fanyang01 commented Oct 24, 2024 •

edited

Loading

Uh oh!

fanyang01 commented Nov 5, 2024

Uh oh!

fanyang01 commented Nov 6, 2024

Uh oh!

pdet commented Nov 6, 2024

Uh oh!

Uh oh!

pkoppstein commented Nov 7, 2024

Uh oh!

fanyang01 commented Nov 8, 2024

Uh oh!

pkoppstein commented Nov 8, 2024

Uh oh!

Uh oh!

csv: parse escape character in unquoted fields #14464

csv: parse escape character in unquoted fields #14464

Uh oh!

Conversation

fanyang01 commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fanyang01 commented Oct 21, 2024

Uh oh!

pdet commented Oct 21, 2024

Uh oh!

fanyang01 commented Oct 22, 2024

Uh oh!

pdet commented Oct 22, 2024

Uh oh!

fanyang01 commented Oct 22, 2024

Uh oh!

fanyang01 commented Oct 23, 2024

Uh oh!

fanyang01 commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fanyang01 commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fanyang01 commented Nov 5, 2024

Uh oh!

fanyang01 commented Nov 6, 2024

Uh oh!

pdet commented Nov 6, 2024

Uh oh!

Uh oh!

pkoppstein commented Nov 7, 2024

Uh oh!

fanyang01 commented Nov 8, 2024

Uh oh!

pkoppstein commented Nov 8, 2024

Uh oh!

Uh oh!

fanyang01 commented Oct 21, 2024 •

edited

Loading

fanyang01 commented Oct 23, 2024 •

edited

Loading

fanyang01 commented Oct 24, 2024 •

edited

Loading