Skip to content

Conversation

deepthidevaki
Copy link
Contributor

@deepthidevaki deepthidevaki commented Aug 5, 2025

Description

The BrokerTopologyManager used to keep the latest topology and update it whenever a new event is received via gossip. Since the ordering of the events from different nodes are not guaranteed, it is unclear which update is the latest. So we used the term to the find out which node is the latest leader, in-case we receive updates out-of-order. This caused issues after a purge operation because the term restarts at 0. Even after receiving latest updates from the all nodes, this state cannot be updated because the term is still lower than the previously known term.

To fix this, we now keep track of all member properties in BrokerTopologyManager. Instead of updating the topology, we calculate a new topology by aggregating the latest member properties (BrokerInfo) that it knows. We still use the term to find the latest leader to handle out-of-order updates. But eventually an update from leader term 0 after purge will be reflected in the topology when the updates from other nodes that they are not a leader is received.

This PR consists of:

  • A new immutable implementation of BrokerClusterState which is always created by aggregating latest member properties
    • It distinguishes the live cluster state which contains volatile states like partition role, health etc. and a expected cluster state which is the configured cluster size, partition count etc.
    • The configured cluster size is only updated via cluster configuration received from the ClusterConfigurationService. This is different from the previous behavior, where it was also updated from the BrokerInfo received via gossip.
  • Since the topology manager now have a hard requirement to receiving update from Cluster configuration service, we changed the creation of beans to ensure any consumers that use BrokerTopologyManager starts cluster configuration services and listens the updates.
  • Many tests are updated to remove the dependency to the old internal implementation class BrokerClusterStateImpl. All usages in the tests are replaced with a mock implementation.

In addition these changes uncovered another bug by DefaultAdvertisedAddressIT. The default host was set by the unified configuration which prevented automatic resolution of advertised address. The test was passing before because it was only checking for topology which will get updated anyway because the member properties in swim is recieved via the confiugred initial-contact point. But with the changes in this PR, the topology is not complete until the Cluster configuration service can gossip the correct cluster configuration. Without correct advertised address, the communication was broken between broker and gateway. To catch such error, the test is also extended to send commands to command-api which should fail if the advertised address is not set correctly.

Checklist

Related issues

closes #35650

@github-actions github-actions bot added the component/zeebe Related to the Zeebe component/team label Aug 5, 2025
@deepthidevaki deepthidevaki force-pushed the dd-35650-purge branch 3 times, most recently from 2df46c5 to e27dca8 Compare August 5, 2025 11:15
@github-actions github-actions bot added component/operate Related to the Operate component/team component/tasklist Related to the Tasklist component/team component/optimize Related to Optimize component/team labels Aug 11, 2025
Previously, it was not possible to overwrite a partition leader with lower
term. This is required to handle out-of-order membership events. With the purge
operation, a partition is re-bootstrapped with a clean state. As a result the new
leader will start with term 1.

To handle out-of-order events properly, instead of updating the existing in-memory
cluster state, we always built it from the set of received member properties. This
way we generate a new topology from the latest member properties of all brokers that
are known to this client/gateway. We still use the term for conflict resolution to
determine who is the latest leader. But since the old leader do not exist anymore,
the newly generate topology will have the correct leader after the partition
re-bootstrap.

This approach consumes extra memory to keep track of all memberProperties in
BrokerTopologyManager. In addition, it has to iterate over all properties to generate
the new topology. However, membership updates are not very often to optimize this path.
So this is an acceptable overhead.
@deepthidevaki deepthidevaki force-pushed the dd-35650-purge branch 2 times, most recently from 8743404 to 9779069 Compare August 14, 2025 09:21
@deepthidevaki
Copy link
Contributor Author

TBD if this needs to be backported.

@deepthidevaki deepthidevaki marked this pull request as ready for review August 14, 2025 13:51
@deepthidevaki deepthidevaki requested a review from a team as a code owner August 14, 2025 13:51
@deepthidevaki deepthidevaki requested review from lenaschoenburg and panagiotisgts and removed request for a team and lenaschoenburg August 14, 2025 13:51
Copy link
Contributor

@panagiotisgts panagiotisgts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments and questions

Copy link
Contributor

@panagiotisgts panagiotisgts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall 🚀 Just some clarifications

…pt broker

BrokerTopologyManager always needs the ClusterConfigurationService to get the
correct cluster topology, because the static information is only updated via
ClusterConfigurationService.
Multiple members with same member-id "migration" were running, which causes
issues in gossiping latest topology info.
@github-actions github-actions bot added the component/c8-api All things unified C8 API, e.g. C8 REST label Aug 19, 2025
@deepthidevaki deepthidevaki added this pull request to the merge queue Aug 19, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 19, 2025
@deepthidevaki deepthidevaki added this pull request to the merge queue Aug 19, 2025
Merged via the queue into main with commit ea115aa Aug 19, 2025
165 of 169 checks passed
@deepthidevaki deepthidevaki deleted the dd-35650-purge branch August 19, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/c8-api All things unified C8 API, e.g. C8 REST component/operate Related to the Operate component/team component/optimize Related to Optimize component/team component/tasklist Related to the Tasklist component/team component/zeebe Related to the Zeebe component/team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partitions are not bootstrapped after purge operation
2 participants