Skip to content

Conversation

chbatey
Copy link
Contributor

@chbatey chbatey commented Jan 3, 2018

There exists a race where a cluster node that is being downed sees its
self as the oldest node (as it has had the other nodes removed) and it
takes over the singleton manager sending the real oldest node to go into
the End state meaning that cluster singletons never work again.

This fix simply prevents Member events being given to the Cluster
Manager FSM during a shut down, instread relying on SelfExiting.

This also hardens the test by not downing the node that the current
sharding coordinator is running on as well as fixing a bug in the
probes.

Refs #24113

…tdown

There exists a race where a cluter node that is being downed seens its
self as the oldest node (as it has had the other nodes removed) and it
takes over the singleton manager sending the real oldest node to go into
the End state meaning that cluster singletons never work again.

This fix simply prevents Member events being given to the Cluster
Manager FSM during a shut down, instread relying on SelfExiting.

This also hardens the test by not downing the node that the current
sharding coordinator is running on as well as fixing a bug in the
probes.
@chbatey
Copy link
Contributor Author

chbatey commented Jan 3, 2018

The test fails when the downed node has the other remembers removed. This happens locally 1/10ish.

@akka-ci akka-ci added the validating PR is currently being validated by Jenkins label Jan 3, 2018
@chbatey
Copy link
Contributor Author

chbatey commented Jan 3, 2018

Gong to run the multi node jobs on repeat job for this

@akka-ci akka-ci added tested PR that was successfully built and tested by Jenkins and removed validating PR is currently being validated by Jenkins labels Jan 3, 2018
@akka-ci
Copy link

akka-ci commented Jan 3, 2018

Test PASSed.

Copy link
Contributor

@patriknw patriknw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

wonder if the original issue should be kept and this is another thing?
I wrote:

The real issue that should be fixed is that there seems to be a race between the CS and the ClusterSingleton observing OldestChanged and terminating coordinator singleton before the graceful sharding stop is done

@chbatey
Copy link
Contributor Author

chbatey commented Jan 4, 2018

Yes as i don't think it'll fix that one ^ this just fixed one test issue + one actual bug. So lets keep #24113 open for changing the test back to shutting down a node that has the coordinator

@chbatey
Copy link
Contributor Author

chbatey commented Jan 5, 2018

This one causing a lot of failures, someone else from @akka/akka-team mind reviewing this?

@@ -409,7 +410,7 @@ class Cluster(val system: ExtendedActorSystem) extends Extension {
* Should not called by the user. The user can issue a LEAVE command which will tell the node
* to go through graceful handoff process `LEAVE -> EXITING -> REMOVED -> SHUTDOWN`.
*/
private[cluster] def shutdown(): Unit = {
@InternalApi private[cluster] def shutdown(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@raboof raboof merged commit 0380cc5 into akka:master Jan 5, 2018
manonthegithub pushed a commit to manonthegithub/akka that referenced this pull request Jan 31, 2018
…tdown (akka#24236)

There exists a race where a cluter node that is being downed seens its
self as the oldest node (as it has had the other nodes removed) and it
takes over the singleton manager sending the real oldest node to go into
the End state meaning that cluster singletons never work again.

This fix simply prevents Member events being given to the Cluster
Manager FSM during a shut down, instread relying on SelfExiting.

This also hardens the test by not downing the node that the current
sharding coordinator is running on as well as fixing a bug in the
probes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tested PR that was successfully built and tested by Jenkins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants