-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Closed
Labels
Description
What happens?
Running UNNEST function over a single row table containing a struct with 350 keys is taking more than 5 seconds on my machine. When I analyze how UNNEST function behaves, I see that the time it takes for the function to run increases exponentially with the number of keys in the data. I also added a comparison to simulate the same behavior by extracting the data, unnesting in python ingesting back on duckdb and reading it back. I would expect for both functions to behave similarly but I see a very large divergence. You can see below the comparison between these two methods of unnest.
To Reproduce
import duckdb
import pyarrow as pa
import time
from matplotlib import pyplot as plt
import timeit
from functools import partial
def get_table(col_size):
"Creates a one row arrow table with the given column size"
row = {f"col_{idx}": f"my_value_{idx}" for idx in range(col_size)}
return pa.Table.from_pydict({"data": pa.array([row])})
def run_unnest(col_size):
"Run unnest natively"
with duckdb.connect() as con:
con.register("my_view", get_table(col_size))
start_time = time.time()
con.sql("select unnest(data) from my_view").fetchall()
end_time = time.time()
return end_time-start_time
def run_python_unnest(col_size):
"Read the data unnest and put it back, then read again"
with duckdb.connect() as con:
con.register("my_view", get_table(col_size))
start_time = time.time()
data = con.sql("select data from my_view").fetchall()
con.register("my_new_view", pa.Table.from_pylist([k[0] for k in data]))
con.execute("select * from my_new_view").fetchall()
end_time = time.time()
return end_time - start_time
def time_function(fn, col_size):
"Take an average of a 5 run. Return milliseconds"
return timeit.timeit(stmt=partial(fn, col_size=col_size), number=5)/5*1000
x = list(range(1,402,20))
y1 = [time_function(run_unnest, col_size=col_size) for col_size in x]
y2 = [time_function(run_python_unnest, col_size=col_size) for col_size in x]
plt.plot(x, y1, "r--", label="duckdb unnest")
plt.plot(x, y2, "b-", label="python unnest")
plt.legend(loc="upper left")
plt.xlabel("Number of columns")
plt.ylabel("Milliseconds to run")
plt.show()
OS:
MacOS 13.2.1 aarch64
DuckDB Version:
0.9.1
DuckDB Client:
Python
Full Name:
Izel Levy
Affiliation:
Explorium
Have you tried this on the latest main
branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have