Parallel CSV Reader #5194

pdet · 2022-11-03T17:35:06Z

Experimental implementation of the Parallel CSV Reader.

We can currently parallelize cases where the quotation, escape, and delimiters are limited to one character.
The parallel CSV reader doesn't also doesn't support new lines inside strings nor the CSV Sniffer (A schema must be provided first).

Since the implementation is not yet integrated with the CSV Sniffer is currently hidden by the

SET experimental_parallel_csv=true;

The CSV reader works by default by throwing as many threads as possible to work on 32 Mb Chunks. The chunk size can be adjusted via the CSV Read function.
e.g.,

SELECT sum(a) FROM read_csv('test/sql/copy/csv/data/test/multi_column_integer.csv',  COLUMNS=STRUCT_PACK(a := 'INTEGER', b := 'INTEGER', c := 'INTEGER'), auto_detect='false', delim = '|', buffer_size=5)

For TPC-H SF1:

Latest DuckDB duckdb-0.5.1
SF:1 Threads: 1 Time:2.9291566249448806

This PR
SF:1 Threads: 1 Time:2.953550708014518
SF:1 Threads: 4 Time:0.8697507920442149

~3.6x faster with 4 threads.

…le threads

…the file is already fully processed

…ed types

…arallel csv reader

…troying my buffers accidentally like 5 commits ago

lnkuiper · 2022-11-10T14:50:28Z

Awesome work @pdet !! This release is going to be awesome :)

pdet · 2022-11-10T14:57:34Z

Thanks @lnkuiper, yeah this release will be lit 🔥

We did some benchmarking yday and now we can load the clickbench csv file in ~400s in the same machine as their official benchmark here

arjenpdevries · 2022-11-10T19:19:56Z

So that's more than an order of improvement, impressive!

Mytherin · 2022-11-10T19:24:43Z

The single-threaded results are actually around 1200 seconds, the results on that page are from the previous DuckDB version that would load all data ingested in a single transaction uncompressed in memory prior to writing it out to disk, causing very poor performance when loading CSV files that were larger than memory as is the case in the benchmark (see #4996 for the PR that fixed that).

arjenpdevries · 2022-11-10T20:45:53Z

Still a great improvement!

hatvik · 2022-11-17T06:33:34Z

@pdet So this works with COPY from csv too?

pdet · 2022-11-17T17:44:27Z

HI @hatvik
Yes, it should work, you do need to set SET experimental_parallel_csv=true; first.

Please let me know if you encounter any issues :-)

jaens · 2022-11-24T16:54:22Z

Is it possible to load in parallel from multiple CSV files at once?

I am currently using datasets that are partitioned into Zstd-compressed fixed-row-count CSV chunks (since that makes it easy to process in parallel with arbitrary tools eg. just using GNU parallel).

Alex-Monahan · 2022-11-25T05:05:46Z

Hello! Using either the glob syntax or passing in a list of files should be run in parallel!

pdet and others added 30 commits October 18, 2022 19:34

Basis for CSV parallel read

f71c246

Simple csv running in parallel

3621f70

moving next out of the readcsv

0c1e806

Separating file handle from buffered reader

60c9b8c

Ok parallel csv reader mostly parallel

0f94abc

Changing battle plan to multiple threads in one buffer

dcd5cd6

[WIP] Continuing to change the csv reader to sahare buffers on multip…

6144e22

…le threads

Alright, ST working

21e40e9

ST working

b9b9e1d

Alright this works with a buffer of size 1

603183a

Dealing with EOF

7bf3eeb

Fixing some errors + adding Batch Index

5833fd7

Correctly deal with the scenario where a thread is initialized after …

25e0754

…the file is already fully processed

Merge branch 'master' into parallel_csv

02851f8

Refactor 0.1

0e232c7

Re-use initial reader options

0cea405

Fixes for headers/skip_rows, linenr in error message, and auto-detect…

9d8c994

…ed types

Add single-threaded CSV reader back in

2528538

Single-threaded CSV reader

7c98632

experimental_parallel_csv setting

738f0f3

1111

635ffb6

Fixing bug the refactoring introduced plus adding tpch tests

fd14f3b

Fixing issues and adding TPCH TPCDS and Clickbench as tests for the p…

e6c2cda

…arallel csv reader

small cleanup test fix

def0e04

Small fix and cleanup

254451e

small fix

edbdd71

Fixing some bugs

97336d2

Merge branch 'master' into parallel_csv

64dc49c

Detecting quoted new lines from within the same buffer

24961a5

Merge remote-tracking branch 'origin/master' into parallel_csv

2ec5518

pdet added 11 commits November 7, 2022 20:00

Merge master stuff

254e97a

Merge remote-tracking branch 'origin/parallel_csv' into parallel_csv

2c4ca96

New-line checking outside buffer boundaries

a8f414c

Possibly more fixes + progress bar

40852f5

Don't need these badboys around

df3f571

initialize

d64ad54

CI stuff

f156302

Fixing amalgamation

380ca0c

Fixing obvious race condition that I created when I thought I was des…

9fe3b94

…troying my buffers accidentally like 5 commits ago

CI debugging

4d30f33

bring \r back

a40c595

Mytherin merged commit 3261e00 into duckdb:master Nov 10, 2022

bucweat mentioned this pull request Feb 2, 2023

Parallel CSV reader and file globbing #6074

Closed

2 tasks

alamb mentioned this pull request Feb 7, 2023

use more than one core/thread when querying CSV apache/datafusion#5205

Closed

alamb mentioned this pull request May 10, 2023

Parallel CSV reading apache/datafusion#6325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel CSV Reader #5194

Parallel CSV Reader #5194

Uh oh!

pdet commented Nov 3, 2022

Uh oh!

lnkuiper commented Nov 10, 2022

Uh oh!

pdet commented Nov 10, 2022

Uh oh!

arjenpdevries commented Nov 10, 2022

Uh oh!

Mytherin commented Nov 10, 2022

Uh oh!

arjenpdevries commented Nov 10, 2022

Uh oh!

hatvik commented Nov 17, 2022

Uh oh!

pdet commented Nov 17, 2022

Uh oh!

jaens commented Nov 24, 2022

Uh oh!

Alex-Monahan commented Nov 25, 2022

Uh oh!

Uh oh!

Parallel CSV Reader #5194

Parallel CSV Reader #5194

Uh oh!

Conversation

pdet commented Nov 3, 2022

Uh oh!

lnkuiper commented Nov 10, 2022

Uh oh!

pdet commented Nov 10, 2022

Uh oh!

arjenpdevries commented Nov 10, 2022

Uh oh!

Mytherin commented Nov 10, 2022

Uh oh!

arjenpdevries commented Nov 10, 2022

Uh oh!

hatvik commented Nov 17, 2022

Uh oh!

pdet commented Nov 17, 2022

Uh oh!

jaens commented Nov 24, 2022

Uh oh!

Alex-Monahan commented Nov 25, 2022

Uh oh!

Uh oh!