Skip to content

How can I call the pre-Spring 2024 CSV parser in v1.1.1? #14111

@marklit

Description

@marklit

What happens?

When I wrote the 1.1B taxi rides benchmark https://tech.marksblogg.com/duckdb-1b-taxi-rides.html#an-alternative-code-path I was given the advice to call DuckDB's old CSV parser instead of the new one which was released this summer.

This worked fine on v1.0.0 but in v1.1.1 af39bd0 I'm finding that the following raises issues every few hundred million records saying 1 or 2 fewer columns were able to be parsed.

$ ~/duckdb_111/duckdb \
    test.duckdb \
    < create.sql

$ for FILENAME in trips_*.csv.gz; do
    echo $FILENAME
    /usr/bin/time \
        ~/duckdb_111/duckdb \
        -c "COPY trips FROM '$FILENAME';" \
        test.duckdb
    du -hs test.duckdb
  done
trips_xae.csv.gz:

Expected Number of Columns: 51 Found: 49

trips_xag.csv.gz:

Expected Number of Columns: 51 Found: 50

trips_xbb.csv.gz:

Expected Number of Columns: 51 Found: 50

In v1.0.0 this was the way to call the new parser:

$ ~/duckdb_111/duckdb test.duckdb < create.sql

$ for FILENAME in trips_x*.csv.gz; do
    echo $FILENAME
    /usr/bin/time \
        ~/duckdb_111/duckdb \
        -c "INSERT INTO trips
                    SELECT *
                    FROM   READ_CSV('$FILENAME');" \
        test.duckdb
    du -hs test.duckdb
  done

That new parser still has the same sorts of inference issues in v1.1.1 as it did in v1.0.0.

The import times for both CSV parsers are almost identical now which, along with the parsing issues, has me wondering if there is a way to call the old CSV parser, which was able to import the 1.1B rows without issue, any more or has its code been removed from DuckDB v1.1.1?

To Reproduce

See above.

OS:

Ubuntu for Windows on Windows 11

DuckDB Version:

v1.1.1 af39bd0

DuckDB Client:

CLI

Hardware:

6 GHz Intel Core i9-14900K, 96 GB of RAM

Full Name:

Mark Litwintschik

Affiliation:

Green Idea

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions