New Algorithm to find a new line on parallel execution #14260

pdet · 2024-10-07T15:04:34Z

Our previous algorithm used a best-effort technique to find a new valid line in a CSV file, which could sometimes cause it to accidentally skip invalid lines and perform slowly when handling large blocks of invalid data.

The new algorithm runs a limited number of tests to identify a new line. It begins by detecting the first newline delimiters in the buffer, then performs up to three checks using our CSV State Machine to determine the new line:

We assume the parser is in a STANDARD state and read until the next newline. If the position of the next newline matches the first newline in the buffer and the row is valid, we are done. Otherwise, we proceed to step 2.
In step 2, we repeat step 1 but starting from a QUOTED state. If it finds a valid row and the next newline position matches the first newline in the buffer, we are done. Otherwise, we move to step 3.
In step 3, we follow the same approach as in the previous steps but begin in the ESCAPED state.
If multiple valid rows are found from different starting states, we return the row at the lowest buffer position.

To enable proper early termination, we removed support for mixed newline delimiters (i.e., files with a mix of \r\n, \r, and \n delimiters)."

One thing to note is that this approach relies on the assumption that there is at least one newline in the buffer. I’m currently working to make this assumption more explicit by using the maximum_line_size option and making our buffer sizes dependent on it. (if maximum_line_size > default_buffer_size.

Mytherin · 2024-10-08T12:03:09Z

Thanks!

pdet added 14 commits October 1, 2024 15:07

New strategy to figure out newlines

3fb0378

new shiny algorithm

5d6aafd

Merge branch 'bug_14177' into find_new_line

59b3635

Add check if we are at the end of the file

981f603

Small adjustments

5705a39

Not support crazy mix of CRNL and NL

18f0898

Always get the best of the best

7989e03

Avoid doing the quoted and escape states

4726621

Adjust test

9b782b0

wip

dae120e

Introducing new state special to new line finder

89c989f

Merge branch 'main' into find_new_line

7817329

Fix test

dff4689

generate files

64683ac

Mytherin changed the base branch from main to feature October 8, 2024 11:59

Mytherin merged commit 336509e into duckdb:feature Oct 8, 2024
39 of 40 checks passed

pdet deleted the find_new_line branch November 27, 2024 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New Algorithm to find a new line on parallel execution #14260

New Algorithm to find a new line on parallel execution #14260

pdet commented Oct 7, 2024

Uh oh!

Uh oh!

Mytherin commented Oct 8, 2024

Uh oh!

Uh oh!

New Algorithm to find a new line on parallel execution #14260

New Algorithm to find a new line on parallel execution #14260

Conversation

pdet commented Oct 7, 2024

Uh oh!

Uh oh!

Mytherin commented Oct 8, 2024

Uh oh!

Uh oh!