Efficient row by row join equivalent of pd.concat axis=1

I'm looking for a fast way to simply combine the columns of multiple tables/parquet files/pandas dataframes and I'm not sure I'm not overlooking something in duckdb.

Here's an example:
We simply have two tables `df1` and `df2` that we would like to write into a single table (the row indices don't matter, we guarantee that they have the same number of rows).

```python
import duckdb
import numpy as np
import pandas as pd
n, m = 10_000, 100
df1=pd.DataFrame(np.random.random((n, m)), columns=[f"df1_{i}" for i in range(m)]).reset_index()
df2=pd.DataFrame(np.random.random((n, m)), columns=[f"df2_{i}" for i in range(m)]).reset_index()
ddb = duckdb.connect()
```

Using python and pandas:

```python
In [42]: %timeit pd.concat([df1, df2], axis=1, join="inner")
28.7 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Using duckdb the way I came up with is a simple join:

```python
In [43]: %timeit ddb.execute("select * from df1 inner join df2 on df1.index=df2.index")
40.9 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

I'm assuming this is slower because we're potentially allowing for a lot more complex cases to be handled in the joining:
```
┌───────────────────────────┐                             
│         HASH_JOIN         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│            LEFT           ├──────────────┐              
│        index=index        │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│        PANDAS_SCAN        ││        PANDAS_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           index           ││           index           │
│           df1_0           ││           df2_0           │
│           df1_1           ││           df2_1           │
```

Is there any way to simply combine the columns efficiently completely ignoring any sort of joins or column values?
Or is this basically the correct way to do this sort of simple combining of tables?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficient row by row join equivalent of pd.concat axis=1 #3423

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Efficient row by row join equivalent of pd.concat axis=1 #3423

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions