-
Notifications
You must be signed in to change notification settings - Fork 326
boot: Drop ostree-finalize-staged.path #3389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skipping CI for Draft Pull Request. |
Haven't really tested this yet. We should reproduce the race in that bug and sanity-check it fixes it before we do this. |
Of course, there's an argument to be made here that rebooting right after staging is silly since you could just as well have not used staging. But we also shouldn't be prone to races in the default deployment path. |
This effectively reverts ac1a919 ("boot: Add ostree-finalize-staged.path"). A bug came in on the OCP side that demonstrates that the way things are setup right now is racy. If a reboot is triggered quickly after staging a deployment, the whole pipeline of: - ostree-finalize-staged.path, which triggers - ostree-finalize-staged.service, which triggers - ostree-finalize-staged-hold.service, may not fully have happened before systemd isolates to `reboot.target` which will want to kill all pending jobs. Just directly starting the systemd unit is less elegant but much more explicit and gets rid of any possible race because it's directly part of the staging operation. Fixes: https://issues.redhat.com/browse/OCPBUGS-51150
1f10145
to
2b9912e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's less risky if we only have the change in sysroot-deploy to synchronously start -finalize-staged.service
. It shouldn't hurt to still have the .path
unit attempting to also queue the start, right?
But, I dunno, I'm fine with this too, it is perhaps clearer to only have one way it's done.
I wouldn't want to recommend that because it re-introduces the possibility of races if there's anything else running at the time that wants to serialize state to |
IOW yes this - I think we should always use staging, and we should definitely fix this race. (That said there is a longer term thing I think we talked about at some point by moving the finalization into the initramfs, i.e. pivoting back to the initramfs at shutdown, which would also have avoided any race like this) |
I have nothing against this change, but how does it fix the race? Whether you trigger via a path systemd is monitoring or you call I'd think that the only way to truly fix this issue would be to inhibit shutdown during the staging process. |
We invoke (The more important fix in here though isn't really that we do it before, but just that we do it at all synchronously as part of the staging API call. That forces the client calling into it to not move on to initiating a reboot -- since it's likely the same software -- until we're guaranteed the service is active instead of relying on the path unit and its dependency tree to have kicked in.) |
Could go either way on this. I guess if we were to backport, I would agree it'd be less risky to just backport the |
OK, this is trivial to reproduce actually by just adding a dropin for
And indeed, this patch fixes the issue. |
This effectively reverts ac1a919 ("boot: Add
ostree-finalize-staged.path").
A bug came in on the OCP side that demonstrates that the way things are setup right now is racy. If a reboot is triggered quickly after staging a deployment, the whole pipeline of:
may not fully have happened before systemd isolates to
reboot.target
which will want to kill all pending jobs.Just directly starting the systemd unit is less elegant but much more explicit and gets rid of any possible race because it's directly part of the staging operation.
Fixes: https://issues.redhat.com/browse/OCPBUGS-51150