-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Parallel CSV Reader #5194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel CSV Reader #5194
Conversation
…the file is already fully processed
…arallel csv reader
…troying my buffers accidentally like 5 commits ago
Awesome work @pdet !! This release is going to be awesome :) |
So that's more than an order of improvement, impressive! |
The single-threaded results are actually around 1200 seconds, the results on that page are from the previous DuckDB version that would load all data ingested in a single transaction uncompressed in memory prior to writing it out to disk, causing very poor performance when loading CSV files that were larger than memory as is the case in the benchmark (see #4996 for the PR that fixed that). |
Still a great improvement! |
@pdet So this works with COPY from csv too? |
HI @hatvik Please let me know if you encounter any issues :-) |
Is it possible to load in parallel from multiple CSV files at once? I am currently using datasets that are partitioned into Zstd-compressed fixed-row-count CSV chunks (since that makes it easy to process in parallel with arbitrary tools eg. just using GNU parallel). |
Hello! Using either the glob syntax or passing in a list of files should be run in parallel! |
Experimental implementation of the Parallel CSV Reader.
We can currently parallelize cases where the quotation, escape, and delimiters are limited to one character.
The parallel CSV reader doesn't also doesn't support new lines inside strings nor the CSV Sniffer (A schema must be provided first).
Since the implementation is not yet integrated with the CSV Sniffer is currently hidden by the
The CSV reader works by default by throwing as many threads as possible to work on 32 Mb Chunks. The chunk size can be adjusted via the CSV Read function.
e.g.,
For TPC-H SF1:
Latest DuckDB duckdb-0.5.1
SF:1 Threads: 1 Time:2.9291566249448806
This PR
SF:1 Threads: 1 Time:2.953550708014518
SF:1 Threads: 4 Time:0.8697507920442149
~3.6x faster with 4 threads.