Skip to content

Conversation

ktoso
Copy link
Contributor

@ktoso ktoso commented Sep 8, 2014

FOR DISCUSSION, see discussion in related topic: #15042

In general the question is

  1. if we want to make cluster routers take an actor system / cluster or not. There's also a broader question, as this issue is somewhat related to Cluster Aware Routers: nr-of-instances parameter usage #13802 (nr-of-instances being confusing), which may point to that cluster routers should become more specialised?

  2. The issue itself can be "solved" easily by returning 0 from the nrOfInstances, but that makes it possible to deadLetter some initial messages sent to the router if normally (when using allowLocal), that would not be the case. This safety though is a bit accidental, and maybe that's fine for fixing this issue?

If we don't want to break APIs, this could still be solved, but by setting the actor system "later" from "the inside", but's going to endup pretty ugly I think...

May eventually resolve #15042

@ktoso ktoso changed the title Discussion #15042 Discussion #15042 - useRole ignored on local, because routers unaware of roles locally Sep 8, 2014
@akka-ci akka-ci added validating PR is currently being validated by Jenkins tested PR that was successfully built and tested by Jenkins and removed validating PR is currently being validated by Jenkins labels Sep 8, 2014
@akka-ci
Copy link

akka-ci commented Sep 8, 2014

Pull request validation: SUCCESS 👍
Refer to this link for build results (access rights to CI server needed):
https://jenkins.akka.io/job/pr-validator-per-commit-jenkins/652/

@rkuhn
Copy link
Contributor

rkuhn commented Sep 8, 2014

I might be missing something but why does the ClusterRouterGroup class need to make the decision? ClusterRouterSettingsBase already contains the useRole information so the Pool handling can generically just not create routees if that does not match.

@ktoso
Copy link
Contributor Author

ktoso commented Sep 8, 2014

By the time we come to the Pool spinning up actors it's too-late* as in "changes behaviour slightly". Leaving it completely up to the Pool is equal to my 2) up there. This introduces a small change in behaviour though:

Now:
scenario: start routees, max 1 on each node, allow local, use role "x", local node does not have role "x"
Router starts, RoutedActorCell executes:

      case pool: Pool 
        if (pool.nrOfInstances > 0)    // the impl of this is not aware of cluster roles, returns maxNr!
          addRoutees(Vector.fill(pool.nrOfInstances)(pool.newRoutee(routeeProps, this)))

which starts nrOfInstaces routees on local even the local node does not match the useRole. This is the bug under discussion.

So the simple fix (2)) is to always return 0 in that nrOfInstaces method, of leaving starting routees 100% up to the Pool. RoutedActorCell will not start local routees "out of the blue". They will be started in reaction to MemberUp(theLocalNode). This code knows about the cluster and roles, the bug is solved. This introduces a delay, which was previously not there. Messages sent to the router, before it gets the MemberUp and starts the local routees will go to deadLetters, where as previously they would have landed in the "pre started local actors".

In summary and when thinking about it...
Maybe people should not expect the local routees to behave any differently than the remote routees.
This means instead of current state of this PR, we hardcode returning 0 in cluster pool router and the above scenario applies.

@rkuhn
Copy link
Contributor

rkuhn commented Sep 8, 2014

What I mean is that instead of returning 0 from nrOfInstances the calling code could just come to the same conclusion by inspecting the settings. And that code very likely does have an ActorSystem at its disposal.

@rkuhn
Copy link
Contributor

rkuhn commented Sep 8, 2014

But I’m fine with your conclusion as well; I don’t think any expectations are violated by not creating local routees when they are not called for ;-) And if it is easy to create the local routees eagerly nevertheless then we might do it just to be nice, but that is not required.

@patriknw
Copy link
Contributor

patriknw commented Sep 9, 2014

@rkuhn I think you are right that the calling code could figure it out, but the problem is that right now the calling code is from the general router code (not aware of such things). Not impossible to change though.

However, we should not rush this, we should think one more time about these kind of things related to the cluster aware routers (and I have swapped focus away this right now).

@rkuhn
Copy link
Contributor

rkuhn commented Sep 9, 2014

agreed

@ktoso
Copy link
Contributor Author

ktoso commented Sep 9, 2014

Totally agreed, don't want to rush this one, as there seems to be confusion around cluster-routers anyway (the #13802).

I'll try to dig up some time to hack on this more soon.

@akka-ci akka-ci added validating PR is currently being validated by Jenkins tested PR that was successfully built and tested by Jenkins and removed tested PR that was successfully built and tested by Jenkins validating PR is currently being validated by Jenkins labels Sep 15, 2014
@akka-ci
Copy link

akka-ci commented Sep 15, 2014

Pull request validation: SUCCESS 👍
Refer to this link for build results (access rights to CI server needed):
https://jenkins.akka.io/job/pr-validator-per-commit-jenkins/744/

@patriknw
Copy link
Contributor

@ktoso I think that sounds good. Overall I think it would be good if the cluster aware pool could create the local routee early if allowLocal=true and the role is ok, because then you can start using the ActorRef immediately without loosing messages.

However, I think we should check the real cluster membership status and only do this if the node is Up. Currently that is not the case, it happily creates the routee assuming that the local node will become Up soon.
Note that you can grab this info with something like:

val selfAddress = Cluster(system).selfAddress
val isUp = Cluster(system).state.members.exists(m => m.address == selfAddress && m.status == MemberStatus.Up) 

@ktoso ktoso force-pushed the clu-useRole-ignored-local-ktoso branch from 8fe31b9 to e870cbd Compare September 15, 2014 14:27
@ktoso
Copy link
Contributor Author

ktoso commented Sep 15, 2014

@patriknw good hint about starting only when UP, incorporated this into the method ( ktoso@e870cbd#diff-5a96a7bdef4c972bff8f8f825b6640acR162 impl of it stays pretty much the same, if we go with 2nd proposal or 3rd proposal ).

I pushed another idea, which would boil down to having "SimplePool" and "ExtendedPool", with the idea that extended pool "can do whatever it needs", currently that's only during start() but it could be extended to do more... Now the question is, adding another interface here makes things a) (bad) inconsistent, we need to handle both Pool types everywhere b) (good) we could make the APIs a bit more specific...

So far I think the simplest would be the 2nd proposal, that is adding ActorSystem to nrOfInstances ( ktoso@9ea9f12 ), as 3rd seems to add too much stuff for sake of "maybe it will be used" hm.

Would you guys mind having a look again, so we can sattle on 2 or 3 and then I'll squash and do a migration guide?

@akka-ci akka-ci added validating PR is currently being validated by Jenkins and removed tested PR that was successfully built and tested by Jenkins labels Sep 15, 2014
@patriknw
Copy link
Contributor

I like the 2nd proposal.
Great that you explore different alternatives.

@akka-ci akka-ci added needs-attention Indicates a PR validation failure (set by CI infrastructure) and removed validating PR is currently being validated by Jenkins labels Sep 15, 2014
@akka-ci
Copy link

akka-ci commented Sep 15, 2014

Pull request validation: FAILED 👎
Refer to this link for build results (access rights to CI server needed):
https://jenkins.akka.io/job/pr-validator-per-commit-jenkins/748/

@ktoso
Copy link
Contributor Author

ktoso commented Sep 16, 2014

/note Didn't get enough time to revisit and clean it up properly today. Will re-visit it once back I guess.

@ktoso ktoso force-pushed the clu-useRole-ignored-local-ktoso branch from e870cbd to 62bd7f0 Compare October 31, 2014 14:34
@ktoso
Copy link
Contributor Author

ktoso commented Oct 31, 2014

Went with the above discussed 2nd approach (as in nrOfInstances(system)) and squashed everything.
Includes migration guide notice.

@ktoso ktoso changed the title Discussion #15042 - useRole ignored on local, because routers unaware of roles locally useRole ignored on local, because routers unaware of roles locally Oct 31, 2014
@ktoso ktoso force-pushed the clu-useRole-ignored-local-ktoso branch 3 times, most recently from 25109a4 to 82e179c Compare October 31, 2014 14:59
@ktoso
Copy link
Contributor Author

ktoso commented Oct 31, 2014

PLS BUILD

@akka-ci akka-ci added validating PR is currently being validated by Jenkins tested PR that was successfully built and tested by Jenkins and removed needs-attention Indicates a PR validation failure (set by CI infrastructure) validating PR is currently being validated by Jenkins labels Oct 31, 2014
@akka-ci
Copy link

akka-ci commented Oct 31, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://jenkins.akka.io/job/pr-validator-per-commit-jenkins/1078/

@akka-ci akka-ci added needs-attention Indicates a PR validation failure (set by CI infrastructure) and removed tested PR that was successfully built and tested by Jenkins labels Oct 31, 2014
@akka-ci
Copy link

akka-ci commented Oct 31, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://jenkins.akka.io/job/pr-validator-per-commit-jenkins/1079/


In order to make cluster routers smarter about when they can start local routees,
``nrOfInstances`` defined on ``Pool`` now takes ``ActorSystem`` as an argument.
You will in case you have implemented a custom Pool you will have to update the method's signature,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

@patriknw
Copy link
Contributor

LGTM!!!!!!!

@akka-ci akka-ci added tested PR that was successfully built and tested by Jenkins and removed needs-attention Indicates a PR validation failure (set by CI infrastructure) labels Oct 31, 2014
@akka-ci
Copy link

akka-ci commented Oct 31, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://jenkins.akka.io/job/pr-validator-per-commit-jenkins/1081/

This is an API breaking change if someone implemented their own Routers.
The change is required because the router must know if the local routees
should be started or not so it has to check the roles of the cluster
member (the local one). We could delay this decision of starting local
routees, but that would allow messages to be dead-letter-ed (bad).
@ktoso ktoso force-pushed the clu-useRole-ignored-local-ktoso branch from 82e179c to 3f12ef2 Compare October 31, 2014 21:33
@ktoso
Copy link
Contributor Author

ktoso commented Oct 31, 2014

Fixed typo in migration guide and... :shipit:! :-)

ktoso added a commit that referenced this pull request Oct 31, 2014
useRole ignored on local, because routers unaware of roles locally
@ktoso ktoso merged commit c2983c7 into akka:master Oct 31, 2014
@ktoso ktoso deleted the clu-useRole-ignored-local-ktoso branch October 31, 2014 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tested PR that was successfully built and tested by Jenkins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

useRole ignored for local local routee
4 participants