[MRG] Add a dask.distributed example #613

GaelVaroquaux · 2018-01-26T10:34:53Z

A simple dask.distributed example

GaelVaroquaux · 2018-01-26T10:37:14Z

For discussion and early review. Cc @TomAugspurger

codecov · 2018-01-26T10:46:32Z

Codecov Report

Merging #613 into master will decrease coverage by 0.09%.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           master    #613     +/-   ##
========================================
- Coverage   95.29%   95.2%   -0.1%     
========================================
  Files          39      39             
  Lines        5462    5462             
========================================
- Hits         5205    5200      -5     
- Misses        257     262      +5

Impacted Files	Coverage Δ
joblib/_parallel_backends.py	`95.25% <0%> (-1.3%)`	⬇️
joblib/test/test_memory.py	`97.8% <0%> (-0.37%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d52110...d110ba5. Read the comment docs.

TomAugspurger · 2018-01-26T10:44:20Z

examples/parallel/distributed_backend_simple.py

+    return i
+
+with joblib.parallel_backend('dask.distributed',
+                             scheduler_host=client.scheduler.address):


client.scheduler.address isn't always available, depending on how the type of cluster (It is available for the LocalCluster created automatically for Client(), not necessarily for others)

I believe the recommended way would be

address = client.scheduler_info()['address'] with joblib.parallel_backend('dask.distributed', scheduler_host=address):

client.scheduler.address is always available, in this case scheduler is actually the connection object, not the scheduler itself. In either event though you don't need either.

Also, if relevant, I hope to release dask.distributed within a week.

TomAugspurger · 2018-01-26T10:47:41Z

examples/parallel/distributed_backend_simple.py

+
+Realistic usage scenario: combining dask code with joblib code, for
+instance using dask for preprocessing data, and scikit-learn for machine
+learning.


Would you consider "prototyping a solution, to later be run on a truly distributed cluster" a "realistic usage scenario".

That (prototyping, before moving to a cluster) and the diagnostics dashboard are my two most common use cases for using the distributed scheduler on a single-machine.

glemaitre · 2018-01-26T17:38:40Z

examples/parallel/distributed_backend_simple.py

@@ -0,0 +1,57 @@
+"""
+Using distributed for single_machine parallel computing


single_machine -> single machine

FWIW we're trying to avoid referring the to code in the github.com/dask/distributed repository as distributed. The reason here is that it's a fairly generic term. Instead I might recommend just using the term dask here, or, if preferred, dask.distributed.

I'll use dask.distributed.

This naming thing is quite confusing (I can't blame you, it's an error that I been into over and over, first with enthought.mayavi => mayavi, and later with scikits.learn => scikit-learn (imported as sklearn)). However, it will confuse the users, and even myself, as it makes the difference and the boundary between projects quite blurry.

glemaitre · 2018-01-26T17:43:22Z

examples/parallel/distributed_backend_simple.py

+import distributed.joblib  # noqa
+
+###############################################################################
+# Run parallel computation using dask.distributed


we could add dask to intersphinx to link to their documentation?

Yey! Great idea!

I would love to see this happen. I have no experience with intersphinx myself but have seen it in use more often more recently and have liked what I've seen.

glemaitre · 2018-01-26T17:48:22Z

examples/parallel/distributed_backend_simple.py

+    return i
+
+
+with joblib.parallel_backend('dask.distributed', scheduler_host=address):


I would even add backend='dask.distributed' or add a small discussion after the title (it seems a bit empty there).

But I am not 100% sure.

This should suffice if you have already created a client (requires master). You also don't need the address = line above

with joblib.parallel_backend('dask'):

Now backend='dask' is very confusing, because, as @TomAugspurger was explaining to me, by default dask doesn't use the distributed backend, but a threading one.

I'll wait for a release of distributed to update this example, as I would like it to run with released version.

Right, but you'll never use this with the legacy threaded scheduler, it isn't sufficiently flexible to handle Joblib's dynamism. There is only one relevant use case for Dask here, and it's the newer scheduler.

mrocklin · 2018-01-26T18:02:00Z

continuous_integration/circle/build_doc.sh

@@ -16,7 +16,7 @@ conda update --yes --quiet conda
 conda create -n $CONDA_ENV_NAME --yes --quiet python=3
 source activate $CONDA_ENV_NAME

-conda install --yes --quiet pip numpy sphinx matplotlib pillow
+conda install --yes --quiet pip numpy sphinx matplotlib pillow dask distributed


Just dask suffices here. The core package is now called dask-core while dask is a metapackage that includes distributed and a few other packages (like numpy, pandas, ...)

mrocklin · 2018-01-26T18:03:03Z

examples/parallel/distributed_backend_simple.py

@@ -0,0 +1,57 @@
+"""
+Using distributed for single_machine parallel computing


FWIW we're trying to avoid referring the to code in the github.com/dask/distributed repository as distributed. The reason here is that it's a fairly generic term. Instead I might recommend just using the term dask here, or, if preferred, dask.distributed.

mrocklin · 2018-01-26T18:03:31Z

examples/parallel/distributed_backend_simple.py

+###############################################################################
+# Setup the distributed client
+###############################################################################
+from distributed import Client


Similarly we tend to encourage from dask.distributed import Client in examples

This is terribly confusing, you realize. I haven't looked at the codebases of the package, but my mental model right now is a bit lost as to what package does what.

It's not that confusing if you're coming to it fresh (which most people are). It's really only a pain for the old hands who were around when we called the thing distributed on its own.

mrocklin · 2018-01-26T18:04:24Z

examples/parallel/distributed_backend_simple.py

+    return i
+
+
+with joblib.parallel_backend('dask.distributed', scheduler_host=address):


This should suffice if you have already created a client (requires master). You also don't need the address = line above

with joblib.parallel_backend('dask'):

aabadie · 2018-01-26T21:11:43Z

examples/parallel/distributed_backend_simple.py

+# Recover the address
+address = client.scheduler_info()['address']
+
+# This import registers the dask backend for joblib


imports and registers

s/dask/dask.distributed/

I think that that was correct: import is used as a noun here.

aabadie · 2018-01-26T21:13:11Z

examples/parallel/distributed_backend_simple.py

+    joblib.Parallel(n_jobs=2, verbose=100)(
+        joblib.delayed(long_running_function)(i)
+        for i in range(10))
+    # We can check that joblib is indeed using the dask.distributed


reword without we

GaelVaroquaux · 2018-01-26T21:40:49Z

I can't get the intersphinx mapping to work. Is distributed.Client documented in what is captured by intersphinx?

GaelVaroquaux · 2018-01-26T21:48:20Z

Also, if relevant, I hope to release dask.distributed within a week.

Cool! I think that I'd like to merge this example first, and modify it as soon as you release dask.distributed. I also hope that we'll release an alpha of the new joblib soon after. That way, we'll all be in production soon.

GaelVaroquaux · 2018-01-26T21:49:45Z

Right, but you'll never use this with the legacy threaded scheduler, it isn't sufficiently flexible to handle Joblib's dynamism. There is only one relevant use case for Dask here, and it's the newer scheduler.

OK, and the threaded scheduler is going away? That would explain part of my confusion.

mrocklin · 2018-01-26T21:51:19Z

OK, and the threaded scheduler is going away? That would explain part of
my confusion.

No it's staying around. It's useful if you don't have tornado, are allergic to dependencies (it's stdlib only), or if you're doing relatively straightforward dask.array work. The newer scheduler is generally a more robustly a good choice though.

GaelVaroquaux · 2018-01-26T21:54:11Z

I found the problem with intersphinx: Client is documented as distributed.client.Client: https://github.com/dask/distributed/blob/master/docs/source/api.rst while we use it as distributed.Client. I believe that the "currentmodule" should be changed in the file above.

mrocklin · 2018-01-26T21:54:52Z

Is there a way for us to flatten the namespace on our end?

GaelVaroquaux · 2018-01-26T21:54:56Z

are allergic to dependencies (it's stdlib only),

Thanks for having my health in mind :).

mrocklin · 2018-01-26T21:56:28Z

Ha!

GaelVaroquaux · 2018-01-27T08:03:27Z

Is there a way for us to flatten the namespace on our end?

In the API documentation file, just use as a "currentmodule" "dask": the currentmodule will define what sphinx considers is the public path of the objects.

ogrisel · 2018-02-07T14:16:59Z

examples/parallel/distributed_backend_simple.py

+
+
+with joblib.parallel_backend('dask.distributed', scheduler_host=address):
+    joblib.Parallel(n_jobs=2, verbose=100)(


In current master, there is no need to put n_jobs=2 here anymore.

ogrisel · 2018-02-07T14:18:05Z

examples/parallel/distributed_backend_simple.py

+        for i in range(10))
+    # Check that joblib is indeed using the dask.distributed
+    # backend
+    print(joblib.Parallel(n_jobs=1)._backend)


This is no longer necessary (on current master), I have added the active backend name in the verbose output of the call to Parallel.

GaelVaroquaux · 2018-03-14T10:22:28Z

I've addressed all issues, and CI is green.

Can I have merge?

mrocklin · 2018-03-14T13:26:44Z

examples/parallel/distributed_backend_simple.py

+###############################################################################
+# The verbose messages below show that the backend is indeed the
+# dask.distributed one
+with joblib.parallel_backend('dask.distributed', scheduler_host=address):


Adding scheduler_host=address is no longer strictly necessary. Dask will use the most recently created Client by default.

Leaving it in is ok too.

Hence it is visible on the example, which is better

GaelVaroquaux · 2018-05-28T18:54:56Z

Merging this guy.

* tag '0.12': (116 commits) Release 0.12 typo typo typo ENH add initializer limiting n_threads for C-libs (joblib#701) DOC better parallel docstring (joblib#704) [MRG] Nested parallel call thread bomb mitigation (joblib#700) MTN vendor loky2.1.3 (joblib#699) Make it possible to configure the reusable executor workers timeout (joblib#698) MAINT increase timeouts to make test more robust on travis DOC: use the .joblib extension instead of .pkl (joblib#697) [MRG] Fix exception handling in nested parallel calls (joblib#696) Fix skip test lz4 not installed (joblib#695) [MRG] numpy_pickle: several enhancements (joblib#626) Introduce Parallel.__call__ backend callbacks (joblib#689) Add distributed on readthedocs (joblib#686) Support registration of external backends (joblib#655) [MRG] Add a dask.distributed example (joblib#613) ENH use cloudpickle to pickle interactively defined callable (joblib#677) CI freeze the version of sklearn0.19.1 and scipy1.0.1 (joblib#685) ...

GaelVaroquaux mentioned this pull request Jan 26, 2018

Display the backend in joblib.Parallel with a high verbosity? #614

Closed

TomAugspurger reviewed Jan 26, 2018

View reviewed changes

GaelVaroquaux changed the title ~~[WIP] Add a dask.distributed example~~ [MRG] Add a dask.distributed example Jan 26, 2018

GaelVaroquaux added the need Review label Jan 26, 2018

glemaitre reviewed Jan 26, 2018

View reviewed changes

mrocklin reviewed Jan 26, 2018

View reviewed changes

aabadie reviewed Jan 26, 2018

View reviewed changes

GaelVaroquaux mentioned this pull request Jan 26, 2018

Set joblib.parallel_backend globally? #620

Open

ogrisel reviewed Feb 7, 2018

View reviewed changes

GaelVaroquaux force-pushed the distributed_example branch from fd6cfdb to c9d6626 Compare March 14, 2018 09:54

mrocklin reviewed Mar 14, 2018

View reviewed changes

GaelVaroquaux added 5 commits May 28, 2018 19:53

Fix merge conflict

4d00980

EXA: Higher verbosity uses stdout

99a0b10

Hence it is visible on the example, which is better

EXA: address @TomAugspurger's comments

7dfafd9

PEP8

044edb0

Fix conflict

3ccc7c6

GaelVaroquaux added 2 commits May 28, 2018 19:53

MISC: Printing of backend no longer needed.

321d9ea

BUG: fix merge error

d110ba5

GaelVaroquaux force-pushed the distributed_example branch from e18d12a to d110ba5 Compare May 28, 2018 17:53

GaelVaroquaux merged commit 2a31976 into joblib:master May 28, 2018

scharlottej13 mentioned this pull request Mar 27, 2023

Update Dask backend #1411

Merged

		@@ -0,0 +1,57 @@
		"""
		Using distributed for single_machine parallel computing

		return i


		with joblib.parallel_backend('dask.distributed', scheduler_host=address):



		with joblib.parallel_backend('dask.distributed', scheduler_host=address):
		joblib.Parallel(n_jobs=2, verbose=100)(

[MRG] Add a dask.distributed example #613

[MRG] Add a dask.distributed example #613

Uh oh!

Conversation

GaelVaroquaux commented Jan 26, 2018

Uh oh!

GaelVaroquaux commented Jan 26, 2018

Uh oh!

codecov bot commented Jan 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Jan 26, 2018

Uh oh!

GaelVaroquaux commented Jan 26, 2018 via email

Uh oh!

GaelVaroquaux commented Jan 26, 2018 via email

Uh oh!

mrocklin commented Jan 26, 2018

Uh oh!

GaelVaroquaux commented Jan 26, 2018 via email

Uh oh!

mrocklin commented Jan 26, 2018

Uh oh!

GaelVaroquaux commented Jan 26, 2018 via email

Uh oh!

mrocklin commented Jan 26, 2018

Uh oh!

GaelVaroquaux commented Jan 27, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 26, 2018 •

edited

Loading

ogrisel Feb 7, 2018 •

edited

Loading