-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
What happens?
Reading multiple csv files with experimental_parallel_csv = true
results in an error Error: Could not convert string 'row_id' to INT64 at line 60338 in column 0.
which I believe due to reading column header of second file as data.
To Reproduce
Related to #5194
I compiled DuckDB with VS2019 using github checkout that I "switched" to Branch_v0.6.1.
I'm experimenting with Parallel CSV Reader and globbing. test.txt (why doesn't github support .sql files?) contains sample query:
test.txt
Here are two sample data files, each with header and 50,000 rows of data:
data.zip
Rename test.txt to test.sql and unzip data.zip so that you have the following all in the same folder:
test.sql
0.csv
1.csv
Here is output of run with experimental_parallel_csv = false
:
> duckdb csv.duckdb
v0.6.1 919cad22e8
Enter ".help" for usage hints.
D .read test.sql
*****************************************************
test sql to read in csv files
SQLite v0.6.1 919cad22e8
msvc-1929
Run Time (s): real 0.383 user 0.296875 sys 0.031250
┌─────────────┬─────────────┬─────────┬───────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ int32 │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼───────┼─────────┼───────┤
│ my_row_id │ BIGINT │ YES │ │ │ │
│ integer │ INTEGER │ YES │ │ │ │
│ float │ DOUBLE │ YES │ │ │ │
│ text │ VARCHAR │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴───────┴─────────┴───────┘
┌───────────┬─────────┬────────┬─────────┐
│ my_row_id │ integer │ float │ text │
│ int64 │ int32 │ double │ varchar │
├───────────┼─────────┼────────┼─────────┤
│ 0 │ 0 │ 1.0 │ zero │
│ 1 │ 1 │ 1.1 │ one │
│ 2 │ 2 │ 1.2 │ two │
│ 3 │ 3 │ 1.3 │ three │
│ 4 │ 4 │ 1.4 │ four │
│ 5 │ 5 │ 1.5 │ five │
│ 6 │ 6 │ 1.6 │ six │
│ 7 │ 7 │ 1.7 │ seven │
│ 8 │ 8 │ 1.8 │ eight │
│ 9 │ 9 │ 1.9 │ nine │
├───────────┴─────────┴────────┴─────────┤
│ 10 rows 4 columns │
└────────────────────────────────────────┘
┌────────┬─────────────┬────────┬────────┐
│ cnt │ sumInt │ minInt │ maxInt │
│ int64 │ int128 │ int32 │ int32 │
├────────┼─────────────┼────────┼────────┤
│ 500000 │ 12499750000 │ 0 │ 49999 │
└────────┴─────────────┴────────┴────────┘
┌──────────────┬───────────┬───────────┐
│ sumRow_Id │ minRow_Id │ maxRow_Id │
│ int128 │ int64 │ int64 │
├──────────────┼───────────┼───────────┤
│ 124999750000 │ 0 │ 499999 │
└──────────────┴───────────┴───────────┘
D
Here is same but with experimental_parallel_csv = true
:
D .read test.sql
*****************************************************
test sql to read in csv files
SQLite v0.6.1 919cad22e8
msvc-1929
Run Time (s): real 0.037 user 0.046875 sys 0.015625
Error: near line 12: Invalid Input Error: Could not convert string 'row_id' to INT64 at line 60338 in column 0. Parser options: DELIMITER=',', QUOTE='"' (default), ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=18446744073709549568, IGNORE_ERRORS=0, ALL_VARCHAR=0
.
.
.
I looks to me like duckdb trying to read header of second file as column data with experimental_parallel_csv = true
but works as you might expect/hope for with experimental_parallel_csv = false
. Wondering if I'm missing something or is parallel reader messing up how the read_csv() is working...
OS:
Windows
DuckDB Version:
v0.6.1 919cad2
DuckDB Client:
command line client
Full Name:
Charlie
Affiliation:
none
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree