-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
dataframegood first issueClearly described and easy to accomplish. Good for beginners to the project.Clearly described and easy to accomplish. Good for beginners to the project.
Description
Suppose I have a dask dataframe with lots of columns, all of them of float dtype. _meta_nonempty
creates each column separately and then makes a dataframe out of it. This makes some operations surprisingly slow, such as loc
. I see some issues with this:
_meta_nonempty
can be optimized for some dataframes.loc
shouldn't require_meta_nonempty
in the first place. It only needsself.obj._meta_nonempty.index
which doesn't require recreating all these columns.
import pandas as pd
import dask.dataframe as dd
data = {}
for i in range(10000):
data["col"+str(i)] = [1.0] * 10
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=1)
# any loc on ddf takes several seconds because of slow _meta_nonempty construction
%time dloc = ddf.loc[0] # 2.82s
%time dmeta = ddf._meta_nonempty # 2.87s
# loc works near-instantly on the original dataframe
%time loc = df.loc[0] # 1.69ms
Metadata
Metadata
Assignees
Labels
dataframegood first issueClearly described and easy to accomplish. Good for beginners to the project.Clearly described and easy to accomplish. Good for beginners to the project.