Skip to content

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Feb 7, 2023

Restarting some buffer variables properly when passing from one file to the other.

Fix: #6074

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@Mytherin
Copy link
Collaborator

Mytherin commented Feb 8, 2023

Looks like there is a test failing because of a missing sort-order (and globs can be read in any order - it is file-system dependent). Could you add a rowsort or change the glob to a list of files?

@Mytherin
Copy link
Collaborator

Mytherin commented Feb 8, 2023

[1/1] (100%): test/sql/copy/csv/parallel/test_multiple_files.test               
===============================================================================
All tests passed (3089 assertions in 1 test case)


--------------------
STDERR
--------------------

==================
WARNING: ThreadSanitizer: data race (pid=15275)

@Mytherin Mytherin merged commit ce70772 into duckdb:master Feb 10, 2023
@Mytherin
Copy link
Collaborator

Thanks!

@bucweat
Copy link
Contributor

bucweat commented Feb 10, 2023

Hi,

Saw the updates here and that #6074 was closed. I checked out head, rebuilt, ran unittest.exe with all tests passing. I then tried my example from that issue and still seeing issue, though with better error message :-)

duckdb csv.duckdb
v0.6.2-dev2340 c0907528a3
Enter ".help" for usage hints.
D .read test.sql
*****************************************************
test sql to read in csv files

SQLite v0.6.2-dev2340 c0907528a3
msvc-1929
Run Time (s): real 0.040 user 0.031250 sys 0.000000
Error: near line 12: Invalid Input Error: Could not convert string 'row_id' to INT64 at line 60338 in column 0. Parser options:
  file=0.csv
  delimiter=','
  quote='"' (default)
  escape='"' (default)
  header=1
  sample_size=18446744073709549568
  ignore_errors=0
  all_varchar=0

Sorry if I jumped the gun on giving this a try...

@bucweat
Copy link
Contributor

bucweat commented Feb 12, 2023

Hi,

@pdet I spent a little more time looking at the fix and saw my test code became a test case test/sql/copy/csv/parallel/test_multiple_files.test :-)

I took a look at the file and the two tests there both use read_csv_auto() and that does seem to work both parallel and not. However, the issue I was seeing was with read_csv() which is what caused the error in my previous comment. I updated the test file adding:

select * from read_csv('test/sql/copy/csv/data/auto/glob/[0-9].csv', AUTO_DETECT=true)

and that worked. I then added something closer to what failed in previous comment:

`select * from read_csv('test/sql/copy/csv/data/auto/glob/[0-9].csv', sample_size=-1, header=True, columns={'row_id':'BIGINT','integer':'INTEGER','float':'DOUBLE', 'text':'VARCHAR'})'

and that still fails. If I SET experimental_parallel_csv=false then that works just fine. So given it works with parallel turned off I'm assuming the command is valid.

I've attached a modified version of test_multiple_files.test with .txt extension added so that I could attach it. It adds a couple of tests as described above and also adds a foreach loop to cycle experimental_parallel_csv first to false where everything works, and then to true where one test fails.

test_multiple_files.test.txt

Hopefully that is helpful...

@Mytherin
Copy link
Collaborator

Could you perhaps open a new issue or re-open your existing issue? It is easy to lose track of these reports otherwise.

@bucweat
Copy link
Contributor

bucweat commented Feb 12, 2023

Yeah I get that for sure....problem is I'm not sure how to re-open #6074 again as I'm not seeing any options on how to do that. I was hoping that mentioning it (@pdet mentioned it at the beginning of this pull request too) would provide enough breadcrumbs... Anyway, I'll look a little harder and if I don't have any luck will submit something new.

@Mytherin
Copy link
Collaborator

I have reopened it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel CSV reader and file globbing
3 participants