-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
As seen here the DistributedPubSubMediator
removes nodes (and hence potential recipients for messages) from it's registry, when the MemberRemoved
event occurs. This might be too late, since it is common behavior, that a gracefully leaving node terminates after it sees it's own MemberRemoved
event. But this might happen before this information has converged across the whole cluster, so other nodes might send messages to a gracefully leaving node.
See the following timeline:
Leader | A | B | ||
---|---|---|---|---|
1 | leave |
|||
2 | gossip | |||
3 | converge | |||
4 | set A to exiting |
|||
5 | gossip | |||
6 | converge | |||
7 | set A to removed |
|||
8 | gossip | |||
9 | see A removed |
|||
10 | terminate | |||
11 | send message via PubSub | |||
12 | message from PubSub is lost because A is terminated | |||
13 | see A removed |
|||
14 | remove A from PubSub | |||
15 | converge |
If DistributedPubSubMediator
would remove nodes from it's registry on exiting
then it would be guaranteed, that gracefully leaving nodes are removed from all PubSub in the cluster before getting their own removed
event and hence terminating. This way no messages would be lost because of termination/pubsub unregister race.
/cc @ktoso