-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Cluster singleton manager: don't send member events to FSM during shutdown #24236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tdown There exists a race where a cluter node that is being downed seens its self as the oldest node (as it has had the other nodes removed) and it takes over the singleton manager sending the real oldest node to go into the End state meaning that cluster singletons never work again. This fix simply prevents Member events being given to the Cluster Manager FSM during a shut down, instread relying on SelfExiting. This also hardens the test by not downing the node that the current sharding coordinator is running on as well as fixing a bug in the probes.
The test fails when the downed node has the other remembers removed. This happens locally 1/10ish. |
Gong to run the multi node jobs on repeat job for this |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
wonder if the original issue should be kept and this is another thing?
I wrote:
The real issue that should be fixed is that there seems to be a race between the CS and the ClusterSingleton observing OldestChanged and terminating coordinator singleton before the graceful sharding stop is done
Yes as i don't think it'll fix that one ^ this just fixed one test issue + one actual bug. So lets keep #24113 open for changing the test back to shutting down a node that has the coordinator |
This one causing a lot of failures, someone else from @akka/akka-team mind reviewing this? |
@@ -409,7 +410,7 @@ class Cluster(val system: ExtendedActorSystem) extends Extension { | |||
* Should not called by the user. The user can issue a LEAVE command which will tell the node | |||
* to go through graceful handoff process `LEAVE -> EXITING -> REMOVED -> SHUTDOWN`. | |||
*/ | |||
private[cluster] def shutdown(): Unit = { | |||
@InternalApi private[cluster] def shutdown(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…tdown (akka#24236) There exists a race where a cluter node that is being downed seens its self as the oldest node (as it has had the other nodes removed) and it takes over the singleton manager sending the real oldest node to go into the End state meaning that cluster singletons never work again. This fix simply prevents Member events being given to the Cluster Manager FSM during a shut down, instread relying on SelfExiting. This also hardens the test by not downing the node that the current sharding coordinator is running on as well as fixing a bug in the probes.
There exists a race where a cluster node that is being downed sees its
self as the oldest node (as it has had the other nodes removed) and it
takes over the singleton manager sending the real oldest node to go into
the End state meaning that cluster singletons never work again.
This fix simply prevents Member events being given to the Cluster
Manager FSM during a shut down, instread relying on SelfExiting.
This also hardens the test by not downing the node that the current
sharding coordinator is running on as well as fixing a bug in the
probes.
Refs #24113