Skip to content

Efficient row by row join equivalent of pd.concat axis=1 #3423

@daviskirk

Description

@daviskirk

I'm looking for a fast way to simply combine the columns of multiple tables/parquet files/pandas dataframes and I'm not sure I'm not overlooking something in duckdb.

Here's an example:
We simply have two tables df1 and df2 that we would like to write into a single table (the row indices don't matter, we guarantee that they have the same number of rows).

import duckdb
import numpy as np
import pandas as pd
n, m = 10_000, 100
df1=pd.DataFrame(np.random.random((n, m)), columns=[f"df1_{i}" for i in range(m)]).reset_index()
df2=pd.DataFrame(np.random.random((n, m)), columns=[f"df2_{i}" for i in range(m)]).reset_index()
ddb = duckdb.connect()

Using python and pandas:

In [42]: %timeit pd.concat([df1, df2], axis=1, join="inner")
28.7 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using duckdb the way I came up with is a simple join:

In [43]: %timeit ddb.execute("select * from df1 inner join df2 on df1.index=df2.index")
40.9 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I'm assuming this is slower because we're potentially allowing for a lot more complex cases to be handled in the joining:

┌───────────────────────────┐                             
│         HASH_JOIN         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│            LEFT           ├──────────────┐              
│        index=index        │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│        PANDAS_SCAN        ││        PANDAS_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           index           ││           index           │
│           df1_0           ││           df2_0           │
│           df1_1           ││           df2_1           │

Is there any way to simply combine the columns efficiently completely ignoring any sort of joins or column values?
Or is this basically the correct way to do this sort of simple combining of tables?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions