Skip to content

DuckDB only use a single thread if the first Parquet file is empty. #10112

@yiyuanliu

Description

@yiyuanliu

What happens?

DuckDB only use a single thread if the first Parquet file is empty.

To Reproduce

Generate multiple test Parquet files.

create table tbl as from generate_series(100000000);
copy tbl to 'test-parquet/' (format parquet, per_thread_output true);

Reading these files is very fast with multiple threads.

explain analyze select * from 'test-parquet/data*.parquet';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 2.76s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.15s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│      generate_series      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│       EC: 109322272       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         100000001         │
│          (57.97s)         │
└───────────────────────────┘

Generate an empty Parquet file and add it to file list. DuckDB will only use a single thread during reading.

explain analyze select * from read_parquet(['test-parquet/empty.parquet', 'test-parquet/data*.parquet']);
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 52.08s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.12s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│      generate_series      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 0           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         100000001         │
│          (34.59s)         │
└───────────────────────────┘

OS:

ubuntu 2204

DuckDB Version:

v0.9.3-dev1411 7d5150c

DuckDB Client:

cli

Full Name:

Yiyuan Liu

Affiliation:

High-Flyer AI

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions