Skip to content

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Nov 3, 2022

Experimental implementation of the Parallel CSV Reader.

We can currently parallelize cases where the quotation, escape, and delimiters are limited to one character.
The parallel CSV reader doesn't also doesn't support new lines inside strings nor the CSV Sniffer (A schema must be provided first).

Since the implementation is not yet integrated with the CSV Sniffer is currently hidden by the

SET experimental_parallel_csv=true;

The CSV reader works by default by throwing as many threads as possible to work on 32 Mb Chunks. The chunk size can be adjusted via the CSV Read function.
e.g.,

SELECT sum(a) FROM read_csv('test/sql/copy/csv/data/test/multi_column_integer.csv',  COLUMNS=STRUCT_PACK(a := 'INTEGER', b := 'INTEGER', c := 'INTEGER'), auto_detect='false', delim = '|', buffer_size=5)

For TPC-H SF1:

Latest DuckDB duckdb-0.5.1
SF:1 Threads: 1 Time:2.9291566249448806

This PR
SF:1 Threads: 1 Time:2.953550708014518
SF:1 Threads: 4 Time:0.8697507920442149

~3.6x faster with 4 threads.

pdet and others added 30 commits October 18, 2022 19:34
@Mytherin Mytherin merged commit 3261e00 into duckdb:master Nov 10, 2022
@lnkuiper
Copy link
Contributor

Awesome work @pdet !! This release is going to be awesome :)

@pdet
Copy link
Contributor Author

pdet commented Nov 10, 2022

Thanks @lnkuiper, yeah this release will be lit 🔥

We did some benchmarking yday and now we can load the clickbench csv file in ~400s in the same machine as their official benchmark here

@arjenpdevries
Copy link
Contributor

So that's more than an order of improvement, impressive!

@Mytherin
Copy link
Collaborator

The single-threaded results are actually around 1200 seconds, the results on that page are from the previous DuckDB version that would load all data ingested in a single transaction uncompressed in memory prior to writing it out to disk, causing very poor performance when loading CSV files that were larger than memory as is the case in the benchmark (see #4996 for the PR that fixed that).

@arjenpdevries
Copy link
Contributor

Still a great improvement!

@hatvik
Copy link
Contributor

hatvik commented Nov 17, 2022

@pdet So this works with COPY from csv too?

@pdet
Copy link
Contributor Author

pdet commented Nov 17, 2022

HI @hatvik
Yes, it should work, you do need to set SET experimental_parallel_csv=true; first.

Please let me know if you encounter any issues :-)

@jaens
Copy link

jaens commented Nov 24, 2022

Is it possible to load in parallel from multiple CSV files at once?

I am currently using datasets that are partitioned into Zstd-compressed fixed-row-count CSV chunks (since that makes it easy to process in parallel with arbitrary tools eg. just using GNU parallel).

@Alex-Monahan
Copy link
Contributor

Hello! Using either the glob syntax or passing in a list of files should be run in parallel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants