Skip to content

Improve the CiliumNode to KVStore synchronization logic of the Cilium operator #35840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 12, 2024

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Nov 7, 2024

Extend the operator logic in charge of synchronizing the CiliumNodes into the corresponding KVStore representation to:

  • Skip updating node objects when the support for running the KVStore in pod network is disabled, given that it is not required in this case, and it causes unnecessary churn on both etcd and all watching agents;
  • Make the deletion of node objects more robust, waiting until the agent running on that agent stops before removing it, to prevent the agent from recreating it right away.

Please review commit by commit, and refer to the individual commit messages for additional details.

Improve the CiliumNode to KVStore synchronization logic of the Cilium operator 

@giorio94 giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/operator Impacts the cilium-operator component area/kvstore Impacts the KVStore package interactions. labels Nov 7, 2024
@giorio94 giorio94 force-pushed the pr/giorio94/main/operator-kvstore-node-sync branch from 20d8250 to be62e77 Compare November 7, 2024 14:04
@giorio94
Copy link
Member Author

giorio94 commented Nov 7, 2024

/test

@giorio94
Copy link
Member Author

giorio94 commented Nov 7, 2024

Ci is reasonably green, besides unrelated flakes. Dropping the test commits.

@giorio94 giorio94 force-pushed the pr/giorio94/main/operator-kvstore-node-sync branch from be62e77 to d8bcfc9 Compare November 7, 2024 17:37
@giorio94
Copy link
Member Author

giorio94 commented Nov 7, 2024

/test

@giorio94 giorio94 requested a review from marseel November 7, 2024 17:37
@giorio94 giorio94 marked this pull request as ready for review November 7, 2024 17:37
@giorio94 giorio94 requested a review from a team as a code owner November 7, 2024 17:37
@giorio94 giorio94 requested review from pippolo84 and removed request for a team November 7, 2024 17:37
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 💯

Base automatically changed from pr/giorio94/main/disable-crd-kvstore-handover to main November 12, 2024 08:39
@giorio94 giorio94 requested review from a team as code owners November 12, 2024 08:39
Currently, the operator includes a logic (enabled by default) to
synchronize CiliumNode objects into the kvstore whenever a change is
detected. Let's always skip it when support for running the kvstore
in pod network is disabled, as not required in that case. Indeed,
each agent is always assumed to be able to connect to the kvstore
and keep it up-to-date, otherwise connectivity to that node is
broken anyways. Yet, the fallback synchronization logic causes
unnecessary churn and load on both etcd and all watching agents
due to the extra events, especially upon operator restart.

Differently, let's preserve the logic deleting stale node entries,
as the corresponding Cilium agent cannot take care of it, and we
would otherwise only rely on lease expiration (which might take
a while depending on its duration).

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
In preparation for the subsequent changes, let's modify the CiliumNode
synchronization handlers so that they can return an error, eventually
causing the operation to be retried via the working queue. All consumers
are amended to return a nil error, besides the kvstore one, that
propagates the possible error returned by the kvstore.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The operator already includes a reconciliation logic to delete node
entries from the kvstore when the corresponding CiliumNode gets deleted.

However, it is currently affected by a race condition, because the
CiliumNode gets typically deleted as a consequence of the Node object
being deleted (as they are bound via an owner reference), although at
that point the Cilium agent on that node may still be running. If
that's the case, it would detect the removal of the kvstore entry,
and react recreating it. Hence, defeating the whole purpose of the
operator logic, and leading to the node entry being eventually deleted
by the lease expiration only. The occurrence of this race condition
is typically signaled by the "Received delete event for key which
re-appeared within delay time window" log message.

Let's address this by extending the GC logic to wait deleting the
node entry until the Cilium agent running on that node has been
stopped, to make sure it will not be recreated right away. The rate
limiter of the associated working queue is also customized to
increase the base delay (5ms by default), considering that it is
pointless to retry with a high frequency in this case.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the pr/giorio94/main/operator-kvstore-node-sync branch from d8bcfc9 to 99cabf9 Compare November 12, 2024 08:44
@giorio94 giorio94 removed request for a team November 12, 2024 08:45
@giorio94 giorio94 removed request for a team, nathanjsweet and qmonnet November 12, 2024 08:45
@giorio94
Copy link
Member Author

Rebased onto main, and removed the reviewers triggered due to the automatic switch of the base branch.

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 12, 2024
@squeed squeed added this pull request to the merge queue Nov 12, 2024
Merged via the queue into main with commit e6ff88c Nov 12, 2024
269 checks passed
@squeed squeed deleted the pr/giorio94/main/operator-kvstore-node-sync branch November 12, 2024 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kvstore Impacts the KVStore package interactions. area/operator Impacts the cilium-operator component ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants