Skip to content

DistributedPubSubMediator should remove nodes on leaving and not on removed #20826

@choffmeister

Description

@choffmeister

As seen here the DistributedPubSubMediator removes nodes (and hence potential recipients for messages) from it's registry, when the MemberRemoved event occurs. This might be too late, since it is common behavior, that a gracefully leaving node terminates after it sees it's own MemberRemoved event. But this might happen before this information has converged across the whole cluster, so other nodes might send messages to a gracefully leaving node.

See the following timeline:

Leader A B
1 leave
2 gossip
3 converge
4 set A to exiting
5 gossip
6 converge
7 set A to removed
8 gossip
9 see A removed
10 terminate
11 send message via PubSub
12 message from PubSub is lost because A is terminated
13 see A removed
14 remove A from PubSub
15 converge

If DistributedPubSubMediator would remove nodes from it's registry on exiting then it would be guaranteed, that gracefully leaving nodes are removed from all PubSub in the cluster before getting their own removed event and hence terminating. This way no messages would be lost because of termination/pubsub unregister race.

/cc @ktoso

Metadata

Metadata

Assignees

No one assigned

    Labels

    1 - triagedTickets that are safe to pick up for contributing in terms of likeliness of being acceptedt:clustert:cluster-tools

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions