Skip to content

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Oct 7, 2024

Our previous algorithm used a best-effort technique to find a new valid line in a CSV file, which could sometimes cause it to accidentally skip invalid lines and perform slowly when handling large blocks of invalid data.

The new algorithm runs a limited number of tests to identify a new line. It begins by detecting the first newline delimiters in the buffer, then performs up to three checks using our CSV State Machine to determine the new line:

We assume the parser is in a STANDARD state and read until the next newline. If the position of the next newline matches the first newline in the buffer and the row is valid, we are done. Otherwise, we proceed to step 2.
In step 2, we repeat step 1 but starting from a QUOTED state. If it finds a valid row and the next newline position matches the first newline in the buffer, we are done. Otherwise, we move to step 3.
In step 3, we follow the same approach as in the previous steps but begin in the ESCAPED state.
If multiple valid rows are found from different starting states, we return the row at the lowest buffer position.

To enable proper early termination, we removed support for mixed newline delimiters (i.e., files with a mix of \r\n, \r, and \n delimiters)."

One thing to note is that this approach relies on the assumption that there is at least one newline in the buffer. I’m currently working to make this assumption more explicit by using the maximum_line_size option and making our buffer sizes dependent on it. (if maximum_line_size > default_buffer_size.

@Mytherin Mytherin changed the base branch from main to feature October 8, 2024 11:59
@Mytherin Mytherin merged commit 336509e into duckdb:feature Oct 8, 2024
39 of 40 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Oct 8, 2024

Thanks!

@pdet pdet deleted the find_new_line branch November 27, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants