Skip to content

Don't mark the agent as ready until successfully connecting to the kvstore (if enabled) #36035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 25, 2024

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Nov 19, 2024

Currently, the kvstore status is considered ready even during the initialization phase, when the connection has not been established yet. This behavior is potentially confusing, as the agents/operator eventually turn ready, breaking out of the startup probe phase, even if an essential subsystem is not ready at all, only to start crashing a few minutes later due to other timeouts kicking in.

Let's modify this behavior so that the kvstore subsystem, and in turn the Cilium agents, are marked ready only after successfully establishing the connection to etcd. Yet, we enable this behavior only if the support for running etcd in pod network is disabled. Differently, we need to preserve the previous behavior, to break the chicken-and-egg dependency during startup.

Even with these changes, the healthiness probe of the Cilium operator is still affected by issues causing it to initially turn ready, even if the connection to the kvstore has not completed yet. The reason being that, currently, only the leader connects to the kvstore [1], which means that the status is reported as healthy until then [2]. This could be fixed by letting all replicas to connect to the kvstore, which should be possible (it historically depended on other subsystems to be started as well, but it is no longer the case since #32916). However, let's leave that for another PR.

[1]: https://github.com/cilium/cilium/blob/mio%2Fkvstore-status/operator/cmd/root.go#L666-L668
[2]: https://github.com/cilium/cilium/blob/mio%2Fkvstore-status/operator/api/health.go#L78-L82

Please review commit by commit, and refer to the individual commit messages for additional details.

Related: #25288

This probe does no longer perform any action at all since [1].
Hence, let's just remove it for the sake of simplicity.

[1]: c8ec8c9 ("daemon: Remove option.DaemonConfig.ConfigPatchMutex")

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/kvstore Impacts the KVStore package interactions. area/agent Cilium agent related. labels Nov 19, 2024
Let's drop it in preparation for the subsequent commit which modifies
the Status signature.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Rework the kvstore Status method so that it directly returns a
models.Status instance (i.e., combination of state and message),
rather than a message and an error. This brings consistency with
the other Status methods, makes the actual state explicit (rather
than relying on whether the error is set), and prevents confusion
due to message and error being the same only in certain cases.

This change is intended to not modify the current behavior. Still,
it prepares for the subsequent commits, aiming to more easily
discern when the etcd connection has not yet been established.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/kvstore-status branch 2 times, most recently from c1728b7 to 89c6073 Compare November 19, 2024 11:04
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 marked this pull request as ready for review November 19, 2024 13:55
@giorio94 giorio94 requested review from a team as code owners November 19, 2024 13:55
@giorio94 giorio94 requested review from ldelossa, marseel, joamaki, pippolo84 and a user November 19, 2024 13:55
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a non-blocking nit left inline. 💯

Regarding #36035 (comment) I agree with your reasoning. Should we maybe just add a release note to inform about the slight change in the reported Status?

@giorio94
Copy link
Member Author

Regarding #36035 (comment) I agree with your reasoning. Should we maybe just add a release note to inform about the slight change in the reported Status?

Hmm, good point. I can add a brief mention to the upgrade notes.

Currently, the kvstore is marked with status OK (and Disabled message)
when it is disabled. Let's modify this to correctly mark it with status
Disabled. As a consequence, let's slightly adapt the logic to consider
that state as successful as well, so that we preserve the current
behavior and not raise an error in that case.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Currently, the kvstore status is considered ready even during the
initialization phase, when the connection has not been established
yet. This behavior is potentially confusing, as the agents/operator
eventually turn ready, breaking out of the startup probe phase, even
if an essential subsystem is not ready at all, only to start crashing
a few minutes later due to other timeouts kicking in.

Let's modify this behavior so that the kvstore subsystem, and in turn
the Cilium agents are marked ready only after successfully establishing
the connection to etcd. Yet, we enable this behavior only if the support
for running etcd in pod network is disabled. Differently, we need to
preserve the previous behavior, to break the chicken-and-egg dependency
during startup.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Ensure that all status probes have completed at least once before
starting serving the agent /healthz API. This prevents possible
race conditions causing the agent to exit early from the startup
probe phase if the healthz endpoint is queried before initializing
any of the fields. Overall, the additional wait time is not expected
to introduce any concern, considering that practically all probes
should return quickly as don't involve expensive operations or API
calls. The only exception is the k8s probe, but at that point of
the initialization we are already guaranteed to have connected
to the k8s API server.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Member Author

/test

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added a few comments, I will do one more pass on Monday :)

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 25, 2024
@julianwiedmann julianwiedmann added this pull request to the merge queue Nov 25, 2024
Merged via the queue into cilium:main with commit dbb4bda Nov 25, 2024
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent Cilium agent related. area/kvstore Impacts the KVStore package interactions. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants