Don't mark the agent as ready until successfully connecting to the kvstore (if enabled) #36035

giorio94 · 2024-11-19T10:31:15Z

Currently, the kvstore status is considered ready even during the initialization phase, when the connection has not been established yet. This behavior is potentially confusing, as the agents/operator eventually turn ready, breaking out of the startup probe phase, even if an essential subsystem is not ready at all, only to start crashing a few minutes later due to other timeouts kicking in.

Let's modify this behavior so that the kvstore subsystem, and in turn the Cilium agents, are marked ready only after successfully establishing the connection to etcd. Yet, we enable this behavior only if the support for running etcd in pod network is disabled. Differently, we need to preserve the previous behavior, to break the chicken-and-egg dependency during startup.

Even with these changes, the healthiness probe of the Cilium operator is still affected by issues causing it to initially turn ready, even if the connection to the kvstore has not completed yet. The reason being that, currently, only the leader connects to the kvstore [1], which means that the status is reported as healthy until then [2]. This could be fixed by letting all replicas to connect to the kvstore, which should be possible (it historically depended on other subsystems to be started as well, but it is no longer the case since #32916). However, let's leave that for another PR.

[1]: https://github.com/cilium/cilium/blob/mio%2Fkvstore-status/operator/cmd/root.go#L666-L668
[2]: https://github.com/cilium/cilium/blob/mio%2Fkvstore-status/operator/api/health.go#L78-L82

Please review commit by commit, and refer to the individual commit messages for additional details.

Related: #25288

This probe does no longer perform any action at all since [1]. Hence, let's just remove it for the sake of simplicity. [1]: c8ec8c9 ("daemon: Remove option.DaemonConfig.ConfigPatchMutex") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

daemon/cmd/status.go

Let's drop it in preparation for the subsequent commit which modifies the Status signature. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

Rework the kvstore Status method so that it directly returns a models.Status instance (i.e., combination of state and message), rather than a message and an error. This brings consistency with the other Status methods, makes the actual state explicit (rather than relying on whether the error is set), and prevents confusion due to message and error being the same only in certain cases. This change is intended to not modify the current behavior. Still, it prepares for the subsequent commits, aiming to more easily discern when the etcd connection has not yet been established. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2024-11-19T11:12:45Z

/test

pippolo84

LGTM, just a non-blocking nit left inline. 💯

Regarding #36035 (comment) I agree with your reasoning. Should we maybe just add a release note to inform about the slight change in the reported Status?

pkg/status/status.go

giorio94 · 2024-11-20T16:28:33Z

Regarding #36035 (comment) I agree with your reasoning. Should we maybe just add a release note to inform about the slight change in the reported Status?

Hmm, good point. I can add a brief mention to the upgrade notes.

Currently, the kvstore is marked with status OK (and Disabled message) when it is disabled. Let's modify this to correctly mark it with status Disabled. As a consequence, let's slightly adapt the logic to consider that state as successful as well, so that we preserve the current behavior and not raise an error in that case. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

Currently, the kvstore status is considered ready even during the initialization phase, when the connection has not been established yet. This behavior is potentially confusing, as the agents/operator eventually turn ready, breaking out of the startup probe phase, even if an essential subsystem is not ready at all, only to start crashing a few minutes later due to other timeouts kicking in. Let's modify this behavior so that the kvstore subsystem, and in turn the Cilium agents are marked ready only after successfully establishing the connection to etcd. Yet, we enable this behavior only if the support for running etcd in pod network is disabled. Differently, we need to preserve the previous behavior, to break the chicken-and-egg dependency during startup. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

Ensure that all status probes have completed at least once before starting serving the agent /healthz API. This prevents possible race conditions causing the agent to exit early from the startup probe phase if the healthz endpoint is queried before initializing any of the fields. Overall, the additional wait time is not expected to introduce any concern, considering that practically all probes should return quickly as don't involve expensive operations or API calls. The only exception is the k8s probe, but at that point of the initialization we are already guaranteed to have connected to the k8s API server. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2024-11-20T16:38:38Z

/test

marseel

Thanks, I've added a few comments, I will do one more pass on Monday :)

daemon/cmd/status.go

pkg/status/status.go

status: drop unused check-locks probe

d1420ed

This probe does no longer perform any action at all since [1]. Hence, let's just remove it for the sake of simplicity. [1]: c8ec8c9 ("daemon: Remove option.DaemonConfig.ConfigPatchMutex") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/kvstore Impacts the KVStore package interactions. area/agent Cilium agent related. labels Nov 19, 2024

giorio94 commented Nov 19, 2024

View reviewed changes

daemon/cmd/status.go Show resolved Hide resolved

giorio94 added 2 commits November 19, 2024 11:54

allocator: drop unused Status method from Backend interface

9b5dd9f

Let's drop it in preparation for the subsequent commit which modifies the Status signature. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 force-pushed the mio/kvstore-status branch 2 times, most recently from c1728b7 to 89c6073 Compare November 19, 2024 11:04

giorio94 marked this pull request as ready for review November 19, 2024 13:55

giorio94 requested review from a team as code owners November 19, 2024 13:55

giorio94 requested review from ldelossa, marseel, joamaki, pippolo84 and a user November 19, 2024 13:55

joamaki approved these changes Nov 19, 2024

View reviewed changes

ghost approved these changes Nov 19, 2024

View reviewed changes

pippolo84 approved these changes Nov 20, 2024

View reviewed changes

pkg/status/status.go Show resolved Hide resolved

giorio94 added 3 commits November 20, 2024 17:37

giorio94 force-pushed the mio/kvstore-status branch from 89c6073 to 585168e Compare November 20, 2024 16:38

marseel reviewed Nov 22, 2024

View reviewed changes

daemon/cmd/status.go Show resolved Hide resolved

daemon/cmd/status.go Show resolved Hide resolved

daemon/cmd/status.go Show resolved Hide resolved

marseel approved these changes Nov 25, 2024

View reviewed changes

pkg/status/status.go Show resolved Hide resolved

ldelossa approved these changes Nov 25, 2024

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 25, 2024

julianwiedmann added this pull request to the merge queue Nov 25, 2024

Merged via the queue into cilium:main with commit dbb4bda Nov 25, 2024
64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't mark the agent as ready until successfully connecting to the kvstore (if enabled) #36035

Don't mark the agent as ready until successfully connecting to the kvstore (if enabled) #36035

Uh oh!

giorio94 commented Nov 19, 2024 •

edited

Loading

Uh oh!

Uh oh!

giorio94 commented Nov 19, 2024

Uh oh!

pippolo84 left a comment

Uh oh!

Uh oh!

giorio94 commented Nov 20, 2024

Uh oh!

giorio94 commented Nov 20, 2024

Uh oh!

marseel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Don't mark the agent as ready until successfully connecting to the kvstore (if enabled) #36035

Don't mark the agent as ready until successfully connecting to the kvstore (if enabled) #36035

Uh oh!

Conversation

giorio94 commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

giorio94 commented Nov 19, 2024

Uh oh!

pippolo84 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

giorio94 commented Nov 20, 2024

Uh oh!

giorio94 commented Nov 20, 2024

Uh oh!

marseel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

giorio94 commented Nov 19, 2024 •

edited

Loading