Skip to content

Conversation

petioptrv
Copy link
Contributor

  • Tests added / passed
  • Passes black dask / flake8 dask

Implementation of #5069

from a numpy array if the dtype is homogeneous and is one of
integer, float or unsigned int.
2. `loc` now gets directly a `meta_nonempty` of the index.
@mrocklin
Copy link
Member

mrocklin commented Nov 5, 2019

This looks good to me. Merging later today if there are no further comments.

@petioptrv if you have any interest, it would be nice to see the performance improvement that this provides. Do you have any interest in running the small example provided in the original issue both before and after this change and reporting the difference? This isn't required, but I suspect that people here would enjoy seeing the results of this work.

@mrocklin mrocklin merged commit 16eb97b into dask:master Nov 5, 2019
@mrocklin
Copy link
Member

mrocklin commented Nov 5, 2019

Merging this in. Thanks @petioptrv !

Also, I notice that this is your first code contribution to this repository. Welcome!

@petioptrv
Copy link
Contributor Author

Apologies for the late reply. I've been very busy at work lately. Thanks for the welcome @mrocklin! We started using Dask at my workplace recently, so I thought it would be the perfect opportunity to get a feel for open-source contributions.

As for the benchmark, here's the original problem with pre- and post-benchmarks:

import time

import pandas as pd
import dask.dataframe as dd

data = {}
for i in range(10000):
   data["col"+str(i)] = [1.0] * 10
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=1)

t0 = time.time()
dloc = ddf.loc[0]
t1 = time.time()
print(t1-t0)  # 2.118 pre vs 0.001 post

t0 = time.time()
dmeta = ddf._meta_nonempty
t1 = time.time()
print(t1-t0)  # 2.628 pre vs 0.984 post

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants