Skip to content

Conversation

taylanisikdemir
Copy link
Member

@taylanisikdemir taylanisikdemir commented Jul 16, 2025

What changed?

  • Updated replication simulation to support active-passive to active-active domain update operation
  • Updated domain handler to support this case
    • Carry over domain's failover version to the corresponding entry in ActiveClusterbyRegion map
    • Increment top level failover version to indicate a cluster change/failover event occurred. Domain update propagation needs this.
  • Update failover history entries to record 3 types of events:
    • active-passive domain failover
    • active-active domain failover
    • active-passive to active-active migration

Why?
Avoid requiring creating new domains to leverage active-active mode by supporting migration

How did you test it?

  • Unit tests for domain handler
  • active-passive to active-active simulation: Validates domain is functioning as active-active afterwards
  • active-active regional failover: Validates that we can do regional failover for active-active domains. e.g. region0-> region1. Workflows previously active on the cluster in region0 are resumed on the cluster in region1

@taylanisikdemir taylanisikdemir changed the title WIP: Active-passive to active-active domain migration support Active-passive to active-active domain migration support Jul 17, 2025
@taylanisikdemir taylanisikdemir marked this pull request as ready for review July 17, 2025 18:20
// we increment failover version so top level failoverVersion is updated and domain data is replicated.
failoverVersion = d.clusterMetadata.GetNextFailoverVersion(
replicationConfig.ActiveClusterName,
failoverVersion+1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why + 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetNextFailoverVersion(0) returns 0. I need it to actually increase.

When an active-passive domain is migrated to active-active,
- The domain record will have the `ActiveClusters` field set
- The existing `ActiveClusterName` field will be left as is.
- The failover version of the domain will be incremented. Even though this is not used for task versions for active-active domains, it's incremented to indicate there was a change in replication config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda afraid that this can cause confusion to people. For active-active domain and active-passive domains, this field will have different semantics. Should we deprecate this field for active-active domain and use a different field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are gonna hide it in UI & CLI response so shouldn't cause confusion unless DB is inspected. However it's not a bad idea to not use this field for active-active domains at all. That requires further changes in domain replication such as this condition. I'll leave that as TODO(active-active) and we can evaluate later

// top level failover version is not used for task versions for active-active domains but we still increment it
// to indicate there was a change in replication config
failoverVersion = d.clusterMetadata.GetNextFailoverVersion(
d.clusterMetadata.GetCurrentClusterName(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use current cluster here and active cluster above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Active-active domain doesn't have a current cluster (unless it's migrated) but we still need to increase the failover version due to the reason I mentioned in other comment.

@taylanisikdemir taylanisikdemir merged commit 11c594a into cadence-workflow:master Jul 18, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants