-
Notifications
You must be signed in to change notification settings - Fork 434
Make input task generator aware of intermediate results #582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you explain what the use case for this change is? In particular, how is it related to #217? |
Side-comment, the AppVeyor failure is not related to this PR (numpy str/repr formatting change in numpy 1.14). |
My particular use case: I want to parallelize tasks on several jobs, but some of those tasks can only be fully defined if the results of the previous tasks are known. So it needs a queue: each time a result is returned it can trigger the addition of a new task in the queue. I want to apply it to bayesian hyper-parameter search, where the state of the bayesian optimizer is updated after each batch of results, before yielding a new batch of points to test. By nature this looks like an asynchronous problem. I have in mind to achieve a POC that rely on this patch for BayesSearchCV from scikit-optimize. The current implementation uses I'm not sure there would be any significant performance improvements (because usually the bottleneck is largely the time required by each task to be processed) but in my opinion it would be more elegant and possibly give a more sensible API. There could possibly be other use cases for the queue mechanism (RL ?). More generally this enables implementing callbacks on intermediate results (logging, maybe early-stopping...). it might feel hacky for now, but I find it fairly flexible, it does not change joblib much and keeps all the nice features. About #217: the only relation is that a generator also let the user access intermediate results before all the tasks are completed.. (on a side note, the implementation looks more difficult, and it wouldn't offer the possibility for callbacks). Some more free thoughts about why I like this: The current |
Codecov Report
@@ Coverage Diff @@
## master #582 +/- ##
===========================================
- Coverage 94.23% 78.18% -16.05%
===========================================
Files 39 38 -1
Lines 5011 5001 -10
===========================================
- Hits 4722 3910 -812
- Misses 289 1091 +802
Continue to review full report at Codecov.
|
... the codecov report seems broken ? (the last commit added the test/example for all parallel backends) |
Those few changes to parallel.py are intended to enable reading intermediate outputs during the call to
Parallel
(Related to #79 and #217).The callback passed to
apply_async
now update an attributelast_async_output
to the runningParallel
instance. It stores the result that has triggered the callback. This value can now be read and processed by the input generator, e.g for logging / printing purpose or to generate new tasks (see example intest_parallel.py
).It seems to work without unexpected behavior for the same reason that the line 217
self.parallel.n_completed_tasks += self.batch_size
is safe: multiprocessing (and loky ?) only use one thread to sequentially execute the callbacks. Hence if the input generator readsParallel.last_async_output
the value is guaranteed to be the result that has triggered the callback that had himself triggered one more read into the generator.(edit: I also tried to make the generator directly read
parallel._output
but it can happen that the result that has triggered the callback is not added toparallel._output
yet)