Skip to content

Parallel CSV reader and file globbing #6074

@bucweat

Description

@bucweat

What happens?

Reading multiple csv files with experimental_parallel_csv = true results in an error Error: Could not convert string 'row_id' to INT64 at line 60338 in column 0. which I believe due to reading column header of second file as data.

To Reproduce

Related to #5194

I compiled DuckDB with VS2019 using github checkout that I "switched" to Branch_v0.6.1.

I'm experimenting with Parallel CSV Reader and globbing. test.txt (why doesn't github support .sql files?) contains sample query:
test.txt

Here are two sample data files, each with header and 50,000 rows of data:
data.zip

Rename test.txt to test.sql and unzip data.zip so that you have the following all in the same folder:

test.sql
0.csv
1.csv

Here is output of run with experimental_parallel_csv = false:

> duckdb csv.duckdb
v0.6.1 919cad22e8
Enter ".help" for usage hints.
D .read test.sql
*****************************************************
test sql to read in csv files

SQLite v0.6.1 919cad22e8
msvc-1929
Run Time (s): real 0.383 user 0.296875 sys 0.031250
┌─────────────┬─────────────┬─────────┬───────┬─────────┬───────┐
│ column_name │ column_type │  null   │  key  │ default │ extra │
│   varchar   │   varchar   │ varchar │ int32 │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼───────┼─────────┼───────┤
│ my_row_id   │ BIGINT      │ YES     │       │         │       │
│ integer     │ INTEGER     │ YES     │       │         │       │
│ float       │ DOUBLE      │ YES     │       │         │       │
│ text        │ VARCHAR     │ YES     │       │         │       │
└─────────────┴─────────────┴─────────┴───────┴─────────┴───────┘
┌───────────┬─────────┬────────┬─────────┐
│ my_row_id │ integer │ float  │  text   │
│   int64   │  int32  │ double │ varchar │
├───────────┼─────────┼────────┼─────────┤
│         0 │       0 │    1.0 │ zero    │
│         1 │       1 │    1.1 │ one     │
│         2 │       2 │    1.2 │ two     │
│         3 │       3 │    1.3 │ three   │
│         4 │       4 │    1.4 │ four    │
│         5 │       5 │    1.5 │ five    │
│         6 │       6 │    1.6 │ six     │
│         7 │       7 │    1.7 │ seven   │
│         8 │       8 │    1.8 │ eight   │
│         9 │       9 │    1.9 │ nine    │
├───────────┴─────────┴────────┴─────────┤
│ 10 rows                      4 columns │
└────────────────────────────────────────┘
┌────────┬─────────────┬────────┬────────┐
│  cnt   │   sumInt    │ minInt │ maxInt │
│ int64  │   int128    │ int32  │ int32  │
├────────┼─────────────┼────────┼────────┤
│ 500000 │ 12499750000 │      0 │  49999 │
└────────┴─────────────┴────────┴────────┘
┌──────────────┬───────────┬───────────┐
│  sumRow_Id   │ minRow_Id │ maxRow_Id │
│    int128    │   int64   │   int64   │
├──────────────┼───────────┼───────────┤
│ 124999750000 │         0 │    499999 │
└──────────────┴───────────┴───────────┘
D

Here is same but with experimental_parallel_csv = true:

D .read test.sql
*****************************************************
test sql to read in csv files

SQLite v0.6.1 919cad22e8
msvc-1929
Run Time (s): real 0.037 user 0.046875 sys 0.015625
Error: near line 12: Invalid Input Error: Could not convert string 'row_id' to INT64 at line 60338 in column 0. Parser options: DELIMITER=',', QUOTE='"' (default), ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=18446744073709549568, IGNORE_ERRORS=0, ALL_VARCHAR=0
.
.
.

I looks to me like duckdb trying to read header of second file as column data with experimental_parallel_csv = true but works as you might expect/hope for with experimental_parallel_csv = false. Wondering if I'm missing something or is parallel reader messing up how the read_csv() is working...

OS:

Windows

DuckDB Version:

v0.6.1 919cad2

DuckDB Client:

command line client

Full Name:

Charlie

Affiliation:

none

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions