Skip to content

Conversation

neilalexander
Copy link
Member

This allows promoting mirrors by removing the mirror configuration in a stream update.

Signed-off-by: Neil Twigg neil@nats.io

@neilalexander neilalexander force-pushed the neil/promotemirror branch 2 times, most recently from e2af3af to 4b40ee6 Compare August 14, 2025 12:01
@neilalexander neilalexander marked this pull request as ready for review August 14, 2025 12:13
@neilalexander neilalexander requested a review from a team as a code owner August 14, 2025 12:13
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like a "step" but worried about ops level semantics. I would like to promote and demote the other at close to the same time.

@@ -6633,7 +6633,8 @@ func (s *Server) jsClusteredStreamUpdateRequest(ci *ClientInfo, acc *Account, su
return
}
// Check for mirror changes which are not allowed.
if !reflect.DeepEqual(newCfg.Mirror, osa.Config.Mirror) {
// We will allow removing the mirror config to "promote" the mirror to a normal stream.
if newCfg.Mirror != nil && !reflect.DeepEqual(newCfg.Mirror, osa.Config.Mirror) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make sure the primary os down here? I would want the mirror to operate like the primary at this point I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we inherit the subjects from the old primary here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume we will be able to reach the primary stream - this is something that will likely be done during a outage, the primary is dead or so down that its unreachable etc and people now are doing recovery playbooks.

So I dont think we can assume its reachable, ditto about being caught up - thats assuming the primary is reachable.

After this the primary Is basically unusable because it cant again become a mirror of this one.

Lots of moving parts in making this right and I think thats all tooling based moving parts, lots of confirming user intentions, asking if its available or not doing checks where the tooling can maybe even fetching the subjects but handling cases where its not possible etc

Copy link
Member Author

@neilalexander neilalexander Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think our consensus is that the server can't reliably judge what the right thing to do is, after all the mirror origin could be online in our domain, online but in a different domain, offline-ish/flaky, offline never to return. The best we can do is just surface the promotion as a primitive operation in the server and try our best to surface it in the tools in a way that makes sense to operators for different playbooks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the main use case for this is IMHO is DR scenarios (specifically 2 sites DR), where we don't have any solution right now.

IMHO this isn't meant for 'automated' multi-site fault tolerance: you must use a stretch cluster (or multiple clusters with cross sourcing) for that, but for disaster recovery where the administrators makes the decision to make what was the mirror at the DR site into the new primary, the stream being mirrored is going to be unreachable/down at that point. Let them control adding the subjects as part of that DR operation (the subjects could be a little bit different anyways depending on how the applications running at the DR site are configured). Just like it's going to be an administrative operation when they decide to do the reverse when the primary site is back up. They will indeed likely have playbooks for handling disasters that will involve many administrative operations, with NATS being just a small part of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekcollison If you're happy with this consensus then we can merge it for experimenting in preview.2 this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add in same subjects from upstream source if we know them. Could be opt-in and fail if we can't reach it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to investigate possibilities like that in a separate PR, I suspect we'd have to issue a stream info to find those and a stream update doesn't feel like the right place to do that.

Signed-off-by: Neil Twigg <neil@nats.io>
Copy link
Member

@MauriceVanVeen MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neilalexander neilalexander merged commit f016a84 into main Aug 19, 2025
69 of 70 checks passed
@neilalexander neilalexander deleted the neil/promotemirror branch August 19, 2025 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants