-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Don't mark the agent as ready until successfully connecting to the kvstore (if enabled) #36035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This probe does no longer perform any action at all since [1]. Hence, let's just remove it for the sake of simplicity. [1]: c8ec8c9 ("daemon: Remove option.DaemonConfig.ConfigPatchMutex") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Let's drop it in preparation for the subsequent commit which modifies the Status signature. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Rework the kvstore Status method so that it directly returns a models.Status instance (i.e., combination of state and message), rather than a message and an error. This brings consistency with the other Status methods, makes the actual state explicit (rather than relying on whether the error is set), and prevents confusion due to message and error being the same only in certain cases. This change is intended to not modify the current behavior. Still, it prepares for the subsequent commits, aiming to more easily discern when the etcd connection has not yet been established. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
c1728b7
to
89c6073
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a non-blocking nit left inline. 💯
Regarding #36035 (comment) I agree with your reasoning. Should we maybe just add a release note to inform about the slight change in the reported Status?
Hmm, good point. I can add a brief mention to the upgrade notes. |
Currently, the kvstore is marked with status OK (and Disabled message) when it is disabled. Let's modify this to correctly mark it with status Disabled. As a consequence, let's slightly adapt the logic to consider that state as successful as well, so that we preserve the current behavior and not raise an error in that case. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the kvstore status is considered ready even during the initialization phase, when the connection has not been established yet. This behavior is potentially confusing, as the agents/operator eventually turn ready, breaking out of the startup probe phase, even if an essential subsystem is not ready at all, only to start crashing a few minutes later due to other timeouts kicking in. Let's modify this behavior so that the kvstore subsystem, and in turn the Cilium agents are marked ready only after successfully establishing the connection to etcd. Yet, we enable this behavior only if the support for running etcd in pod network is disabled. Differently, we need to preserve the previous behavior, to break the chicken-and-egg dependency during startup. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Ensure that all status probes have completed at least once before starting serving the agent /healthz API. This prevents possible race conditions causing the agent to exit early from the startup probe phase if the healthz endpoint is queried before initializing any of the fields. Overall, the additional wait time is not expected to introduce any concern, considering that practically all probes should return quickly as don't involve expensive operations or API calls. The only exception is the k8s probe, but at that point of the initialization we are already guaranteed to have connected to the k8s API server. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
89c6073
to
585168e
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've added a few comments, I will do one more pass on Monday :)
Currently, the kvstore status is considered ready even during the initialization phase, when the connection has not been established yet. This behavior is potentially confusing, as the agents/operator eventually turn ready, breaking out of the startup probe phase, even if an essential subsystem is not ready at all, only to start crashing a few minutes later due to other timeouts kicking in.
Let's modify this behavior so that the kvstore subsystem, and in turn the Cilium agents, are marked ready only after successfully establishing the connection to etcd. Yet, we enable this behavior only if the support for running etcd in pod network is disabled. Differently, we need to preserve the previous behavior, to break the chicken-and-egg dependency during startup.
Even with these changes, the healthiness probe of the Cilium operator is still affected by issues causing it to initially turn ready, even if the connection to the kvstore has not completed yet. The reason being that, currently, only the leader connects to the kvstore [1], which means that the status is reported as healthy until then [2]. This could be fixed by letting all replicas to connect to the kvstore, which should be possible (it historically depended on other subsystems to be started as well, but it is no longer the case since #32916). However, let's leave that for another PR.
[1]: https://github.com/cilium/cilium/blob/mio%2Fkvstore-status/operator/cmd/root.go#L666-L668
[2]: https://github.com/cilium/cilium/blob/mio%2Fkvstore-status/operator/api/health.go#L78-L82
Please review commit by commit, and refer to the individual commit messages for additional details.
Related: #25288