Skip to content

Dataframe.loc is slow when there's a lot of columns #5069

@tshatrov

Description

@tshatrov

Suppose I have a dask dataframe with lots of columns, all of them of float dtype. _meta_nonempty creates each column separately and then makes a dataframe out of it. This makes some operations surprisingly slow, such as loc. I see some issues with this:

  1. _meta_nonempty can be optimized for some dataframes.
  2. loc shouldn't require _meta_nonempty in the first place. It only needs self.obj._meta_nonempty.index which doesn't require recreating all these columns.
import pandas as pd
import dask.dataframe as dd

data = {}
for i in range(10000):
   data["col"+str(i)] = [1.0] * 10
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=1)

# any loc on ddf takes several seconds because of slow _meta_nonempty construction
%time dloc = ddf.loc[0]   # 2.82s
%time dmeta = ddf._meta_nonempty    # 2.87s

# loc works near-instantly on the original dataframe
%time loc = df.loc[0]   #  1.69ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataframegood first issueClearly described and easy to accomplish. Good for beginners to the project.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions