-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
I'm looking for a fast way to simply combine the columns of multiple tables/parquet files/pandas dataframes and I'm not sure I'm not overlooking something in duckdb.
Here's an example:
We simply have two tables df1
and df2
that we would like to write into a single table (the row indices don't matter, we guarantee that they have the same number of rows).
import duckdb
import numpy as np
import pandas as pd
n, m = 10_000, 100
df1=pd.DataFrame(np.random.random((n, m)), columns=[f"df1_{i}" for i in range(m)]).reset_index()
df2=pd.DataFrame(np.random.random((n, m)), columns=[f"df2_{i}" for i in range(m)]).reset_index()
ddb = duckdb.connect()
Using python and pandas:
In [42]: %timeit pd.concat([df1, df2], axis=1, join="inner")
28.7 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using duckdb the way I came up with is a simple join:
In [43]: %timeit ddb.execute("select * from df1 inner join df2 on df1.index=df2.index")
40.9 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm assuming this is slower because we're potentially allowing for a lot more complex cases to be handled in the joining:
┌───────────────────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ LEFT ├──────────────┐
│ index=index │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PANDAS_SCAN ││ PANDAS_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ index ││ index │
│ df1_0 ││ df2_0 │
│ df1_1 ││ df2_1 │
Is there any way to simply combine the columns efficiently completely ignoring any sort of joins or column values?
Or is this basically the correct way to do this sort of simple combining of tables?