-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
(2.12) Allow promoting mirrors #7171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e2af3af
to
4b40ee6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like a "step" but worried about ops level semantics. I would like to promote and demote the other at close to the same time.
@@ -6633,7 +6633,8 @@ func (s *Server) jsClusteredStreamUpdateRequest(ci *ClientInfo, acc *Account, su | |||
return | |||
} | |||
// Check for mirror changes which are not allowed. | |||
if !reflect.DeepEqual(newCfg.Mirror, osa.Config.Mirror) { | |||
// We will allow removing the mirror config to "promote" the mirror to a normal stream. | |||
if newCfg.Mirror != nil && !reflect.DeepEqual(newCfg.Mirror, osa.Config.Mirror) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make sure the primary os down here? I would want the mirror to operate like the primary at this point I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we inherit the subjects from the old primary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assume we will be able to reach the primary stream - this is something that will likely be done during a outage, the primary is dead or so down that its unreachable etc and people now are doing recovery playbooks.
So I dont think we can assume its reachable, ditto about being caught up - thats assuming the primary is reachable.
After this the primary Is basically unusable because it cant again become a mirror of this one.
Lots of moving parts in making this right and I think thats all tooling based moving parts, lots of confirming user intentions, asking if its available or not doing checks where the tooling can maybe even fetching the subjects but handling cases where its not possible etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think our consensus is that the server can't reliably judge what the right thing to do is, after all the mirror origin could be online in our domain, online but in a different domain, offline-ish/flaky, offline never to return. The best we can do is just surface the promotion as a primitive operation in the server and try our best to surface it in the tools in a way that makes sense to operators for different playbooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the main use case for this is IMHO is DR scenarios (specifically 2 sites DR), where we don't have any solution right now.
IMHO this isn't meant for 'automated' multi-site fault tolerance: you must use a stretch cluster (or multiple clusters with cross sourcing) for that, but for disaster recovery where the administrators makes the decision to make what was the mirror at the DR site into the new primary, the stream being mirrored is going to be unreachable/down at that point. Let them control adding the subjects as part of that DR operation (the subjects could be a little bit different anyways depending on how the applications running at the DR site are configured). Just like it's going to be an administrative operation when they decide to do the reverse when the primary site is back up. They will indeed likely have playbooks for handling disasters that will involve many administrative operations, with NATS being just a small part of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekcollison If you're happy with this consensus then we can merge it for experimenting in preview.2 this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add in same subjects from upstream source if we know them. Could be opt-in and fail if we can't reach it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better to investigate possibilities like that in a separate PR, I suspect we'd have to issue a stream info to find those and a stream update doesn't feel like the right place to do that.
4b40ee6
to
1513c51
Compare
Signed-off-by: Neil Twigg <neil@nats.io>
1513c51
to
4008ce6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This allows promoting mirrors by removing the mirror configuration in a stream update.
Signed-off-by: Neil Twigg neil@nats.io