Improve the CiliumNode to KVStore synchronization logic of the Cilium operator #35840

giorio94 · 2024-11-07T14:01:39Z

Extend the operator logic in charge of synchronizing the CiliumNodes into the corresponding KVStore representation to:

Skip updating node objects when the support for running the KVStore in pod network is disabled, given that it is not required in this case, and it causes unnecessary churn on both etcd and all watching agents;
Make the deletion of node objects more robust, waiting until the agent running on that agent stops before removing it, to prevent the agent from recreating it right away.

Please review commit by commit, and refer to the individual commit messages for additional details.

Improve the CiliumNode to KVStore synchronization logic of the Cilium operator

giorio94 · 2024-11-07T14:18:48Z

/test

giorio94 · 2024-11-07T17:36:45Z

Ci is reasonably green, besides unrelated flakes. Dropping the test commits.

giorio94 · 2024-11-07T17:37:46Z

/test

pippolo84

Nice! 💯

Currently, the operator includes a logic (enabled by default) to synchronize CiliumNode objects into the kvstore whenever a change is detected. Let's always skip it when support for running the kvstore in pod network is disabled, as not required in that case. Indeed, each agent is always assumed to be able to connect to the kvstore and keep it up-to-date, otherwise connectivity to that node is broken anyways. Yet, the fallback synchronization logic causes unnecessary churn and load on both etcd and all watching agents due to the extra events, especially upon operator restart. Differently, let's preserve the logic deleting stale node entries, as the corresponding Cilium agent cannot take care of it, and we would otherwise only rely on lease expiration (which might take a while depending on its duration). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

In preparation for the subsequent changes, let's modify the CiliumNode synchronization handlers so that they can return an error, eventually causing the operation to be retried via the working queue. All consumers are amended to return a nil error, besides the kvstore one, that propagates the possible error returned by the kvstore. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

The operator already includes a reconciliation logic to delete node entries from the kvstore when the corresponding CiliumNode gets deleted. However, it is currently affected by a race condition, because the CiliumNode gets typically deleted as a consequence of the Node object being deleted (as they are bound via an owner reference), although at that point the Cilium agent on that node may still be running. If that's the case, it would detect the removal of the kvstore entry, and react recreating it. Hence, defeating the whole purpose of the operator logic, and leading to the node entry being eventually deleted by the lease expiration only. The occurrence of this race condition is typically signaled by the "Received delete event for key which re-appeared within delay time window" log message. Let's address this by extending the GC logic to wait deleting the node entry until the Cilium agent running on that node has been stopped, to make sure it will not be recreated right away. The rate limiter of the associated working queue is also customized to increase the base delay (5ms by default), considering that it is pointless to retry with a high frequency in this case. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2024-11-12T08:45:57Z

Rebased onto main, and removed the reviewers triggered due to the automatic switch of the base branch.

marseel

Thanks, lgtm!

giorio94 · 2024-11-12T10:52:13Z

/test

giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/operator Impacts the cilium-operator component area/kvstore Impacts the KVStore package interactions. labels Nov 7, 2024

giorio94 force-pushed the pr/giorio94/main/operator-kvstore-node-sync branch from 20d8250 to be62e77 Compare November 7, 2024 14:04

giorio94 force-pushed the pr/giorio94/main/operator-kvstore-node-sync branch from be62e77 to d8bcfc9 Compare November 7, 2024 17:37

giorio94 requested a review from marseel November 7, 2024 17:37

giorio94 marked this pull request as ready for review November 7, 2024 17:37

giorio94 requested a review from a team as a code owner November 7, 2024 17:37

giorio94 requested review from pippolo84 and removed request for a team November 7, 2024 17:37

pippolo84 approved these changes Nov 8, 2024

View reviewed changes

Base automatically changed from pr/giorio94/main/disable-crd-kvstore-handover to main November 12, 2024 08:39

giorio94 requested review from a team as code owners November 12, 2024 08:39

giorio94 requested review from nathanjsweet and qmonnet November 12, 2024 08:39

giorio94 added 3 commits November 12, 2024 09:44

giorio94 force-pushed the pr/giorio94/main/operator-kvstore-node-sync branch from d8bcfc9 to 99cabf9 Compare November 12, 2024 08:44

giorio94 removed request for a team November 12, 2024 08:45

giorio94 removed request for a team, nathanjsweet and qmonnet November 12, 2024 08:45

marseel approved these changes Nov 12, 2024

View reviewed changes

giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 12, 2024

squeed added this pull request to the merge queue Nov 12, 2024

Merged via the queue into main with commit e6ff88c Nov 12, 2024
269 checks passed

squeed deleted the pr/giorio94/main/operator-kvstore-node-sync branch November 12, 2024 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the CiliumNode to KVStore synchronization logic of the Cilium operator #35840

Improve the CiliumNode to KVStore synchronization logic of the Cilium operator #35840

Uh oh!

giorio94 commented Nov 7, 2024

Uh oh!

giorio94 commented Nov 7, 2024

Uh oh!

giorio94 commented Nov 7, 2024

Uh oh!

giorio94 commented Nov 7, 2024

Uh oh!

pippolo84 left a comment

Uh oh!

giorio94 commented Nov 12, 2024

Uh oh!

marseel left a comment

Uh oh!

giorio94 commented Nov 12, 2024

Uh oh!

Uh oh!

Uh oh!

Improve the CiliumNode to KVStore synchronization logic of the Cilium operator #35840

Improve the CiliumNode to KVStore synchronization logic of the Cilium operator #35840

Uh oh!

Conversation

giorio94 commented Nov 7, 2024

Uh oh!

giorio94 commented Nov 7, 2024

Uh oh!

giorio94 commented Nov 7, 2024

Uh oh!

giorio94 commented Nov 7, 2024

Uh oh!

pippolo84 left a comment

Choose a reason for hiding this comment

Uh oh!

giorio94 commented Nov 12, 2024

Uh oh!

marseel left a comment

Choose a reason for hiding this comment

Uh oh!

giorio94 commented Nov 12, 2024

Uh oh!

Uh oh!

Uh oh!