-
Notifications
You must be signed in to change notification settings - Fork 252
feat: add early stopping to Pregel #550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add early stopping to Pregel #550
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #550 +/- ##
==========================================
- Coverage 91.43% 89.60% -1.83%
==========================================
Files 18 18
Lines 829 914 +85
Branches 52 110 +58
==========================================
+ Hits 758 819 +61
- Misses 71 95 +24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@SauronShepherd could you please take a look? It is exactly as we discussed in email. I also created #553 and I'm going to write a deep dive into how to use Pregel, so I will describe |
Some Thoughts:
Other Observations:
More Thoughts: I think your "early stopping" feature is a great addition, particularly for cases where users don't know or can't estimate the required number of iterations, to prevent having many more iterations than necessary. However, it may introduce some overhead, so when All that being said, I don’t believe any of my comments prevent this PR from being approved. We can merge it now and revisit it in the future. |
Thanks for the review!
Nice catch, thanks! Fixed!
It was added with a SparkConnect PR. It is not related anyhow to the current PR and it is marked as a change because I wrap all the
Fixed (actually it was before this PR)!
Fixed!
Let's leave it for the further PRs? Otherwise it is hard to track performance improvements...
I don't think so, but let's leave it for the next PR?
I added it to both scala docstring and python docstring: /**
* Should Pregel stop earlier in case of no new messages to send?
*
* Early stopping allows to terminate Pregel before reaching maxIter by checking is there any
* non-null message or not. While in some cases it may gain significant performance boost, it
* other cases it can tend to performance degradation, because checking is messages DataFrame is
* empty or not is an action and requires materialization of the Spark Plan with some additional
* computations.
*
* In the case when user can assume a good value of maxIter it is recommended to leave this
* value to the default "false". In the case when it is hard to estimate an amount of iterations
* required for convergence, it is recommended to set this value to "false" to avoid iterating
* over convergence until reaching maxIter. When this value is "true", maxIter can be set to a
* bigger value without risks.
*
* @param value
* should Pregel checks for the termination condition on each step
* @return
*/ I have also plans to write a comprehensive guide about Pregel.
It may fail in the case when there were some transformations in |
I'll have to add some minor changes in the Pregel class because of #554, but I'll keep the |
@SauronShepherd Let's merge this one to unblock your work on #554 and avoid merge conflicts? |
I don't see any other option but commenting on this issue. |
@rjurney Hi! Could you take a look and if everything is good, could you please merge it? |
@rjurney what do you think about it? |
@SauronShepherd I cannot get an answer from @rjurney, could you please review it? I need it to finalize the new proposed version of the "no-graphx" ShortestPaths. |
@SemyonSinchenko sorry about that, I opened it to review last night. On it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What changes were proposed in this pull request?
I added a
breakable
->break
to thePregel
. In the case of convergence before reachingmaxIter
it will give a significant performance benefit. In the case of not convergence before reachingmaxIter
it may introduce a small overhead, but becauseDataset.isEmpty
is not an action but is a quite cheap operation, the performance degradation is very slow. Another strong benefit to merge it is that allows users to set a very high value ofmaxIter
if they are sure that their Pregel should converge, so it simplify development and usage.Why are the changes needed?
As described in #549 / #368
Close #549
Close #368