Skip to content

UNNEST is very slow on a struct with high number of keys over arrow data #9628

@izellevy

Description

@izellevy

What happens?

Running UNNEST function over a single row table containing a struct with 350 keys is taking more than 5 seconds on my machine. When I analyze how UNNEST function behaves, I see that the time it takes for the function to run increases exponentially with the number of keys in the data. I also added a comparison to simulate the same behavior by extracting the data, unnesting in python ingesting back on duckdb and reading it back. I would expect for both functions to behave similarly but I see a very large divergence. You can see below the comparison between these two methods of unnest.

Screenshot 2023-11-09 at 18 01 46

To Reproduce

import duckdb
import pyarrow as pa
import time
from matplotlib import pyplot as plt
import timeit
from functools import partial

def get_table(col_size):
    "Creates a one row arrow table with the given column size"
    row = {f"col_{idx}": f"my_value_{idx}" for idx in range(col_size)}
    return pa.Table.from_pydict({"data": pa.array([row])})

def run_unnest(col_size):    
    "Run unnest natively"
    with duckdb.connect() as con:
        con.register("my_view", get_table(col_size))
        start_time = time.time()
        con.sql("select unnest(data) from my_view").fetchall()
        end_time = time.time()
    return end_time-start_time

def run_python_unnest(col_size):
    "Read the data unnest and put it back, then read again"
    with duckdb.connect() as con:
        con.register("my_view", get_table(col_size))
        start_time = time.time()
        data = con.sql("select data from my_view").fetchall()
        con.register("my_new_view", pa.Table.from_pylist([k[0] for k in data]))
        con.execute("select * from my_new_view").fetchall()
        end_time = time.time()
    return end_time - start_time

def time_function(fn, col_size):
    "Take an average of a 5 run. Return milliseconds"
    return timeit.timeit(stmt=partial(fn, col_size=col_size), number=5)/5*1000

x = list(range(1,402,20))
y1 = [time_function(run_unnest, col_size=col_size) for col_size in x]
y2 = [time_function(run_python_unnest, col_size=col_size) for col_size in x]

plt.plot(x, y1, "r--", label="duckdb unnest")
plt.plot(x, y2, "b-", label="python unnest")
plt.legend(loc="upper left")
plt.xlabel("Number of columns")
plt.ylabel("Milliseconds to run")
plt.show()

OS:

MacOS 13.2.1 aarch64

DuckDB Version:

0.9.1

DuckDB Client:

Python

Full Name:

Izel Levy

Affiliation:

Explorium

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions