New Algorithm to find a new line on parallel execution #14260
Merged
+196
−134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Our previous algorithm used a best-effort technique to find a new valid line in a CSV file, which could sometimes cause it to accidentally skip invalid lines and perform slowly when handling large blocks of invalid data.
The new algorithm runs a limited number of tests to identify a new line. It begins by detecting the first newline delimiters in the buffer, then performs up to three checks using our CSV State Machine to determine the new line:
We assume the parser is in a
STANDARD
state and read until the next newline. If the position of the next newline matches the first newline in the buffer and the row is valid, we are done. Otherwise, we proceed to step 2.In step 2, we repeat step 1 but starting from a
QUOTED
state. If it finds a valid row and the next newline position matches the first newline in the buffer, we are done. Otherwise, we move to step 3.In step 3, we follow the same approach as in the previous steps but begin in the
ESCAPED
state.If multiple valid rows are found from different starting states, we return the row at the lowest buffer position.
To enable proper early termination, we removed support for mixed newline delimiters (i.e., files with a mix of
\r\n
,\r
, and\n
delimiters)."One thing to note is that this approach relies on the assumption that there is at least one newline in the buffer. I’m currently working to make this assumption more explicit by using the
maximum_line_size
option and making our buffer sizes dependent on it. (ifmaximum_line_size > default_buffer_size
.