Skip to content

Nondeterministic wrong answers, segfaults, and apparent memory corruptions when passing multiple empty column names to read_csv #14428

@wilcoxjay

Description

@wilcoxjay

What happens?

Consider the following call to read_csv:

from read_csv('bug.csv', header=false, names=['','']);

When run with any table in bug.csv with at least two rows and at least two columns, it sometimes segfaults, sometimes prints partial garbage, and sometimes returns wrong answers.

To Reproduce

For example, with the following CSV file, bug.csv:

a,b
c,d

running the query above sometimes yields:

$ duckdb
v1.1.2 f680b7d08f
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D from read_csv('bug.csv', header=false, names=['', '']);
┌─────────┬─────────┐
│   C0    │   C1    │
│ varchar │ varchar │
├─────────┼─────────┤
│ b       │         │
│ d       │         │
└─────────┴─────────┘

which is just wrong. Other times, it segfaults.

If we add a few more rows to the CSV:

a,b
c,d
e,f
g,h

then we sometimes get a similar wrong answer like this:

D from read_csv('bug.csv', header=false, names=['', '']);
┌─────────┬─────────┐
│   C0    │   C1    │
│ varchar │ varchar │
├─────────┼─────────┤
│ b       │         │
│ d       │         │
│ f       │         │
│ h       │         │
└─────────┴─────────┘

And sometimes we segfault.

But sometimes we also get exciting stuff like this:

D from read_csv('bug.csv', header=false, names=['', '']);
┌─────────┬─────────┐
│   C0    │   C1    │
│ varchar │ varchar │
├─────────┼─────────┤
│ b       │         │
│ d       │         │
│ f       │ \0\0    │
│ h       │ \0\0    │
└─────────┴─────────┘

or this:

D from read_csv('bug.csv', header=false, names=['', '']);  
┌─────────┬─────────┐
│   C0    │   C1    │                                                                                                                                                                                                                                                                                                                                            
│ varchar │ varchar │
├─────────┼─────────┤
│ b       │         │
│ d       │         │
│ f       │ f       │
│ h       │ h       │
└─────────┴─────────┘

OS:

macOS 13.7

DuckDB Version:

v1.1.2 f680b7d

DuckDB Client:

CLI

Hardware:

Apple M1

Full Name:

James Wilcox

Affiliation:

University of Washington

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions