-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Cilium abandons an iteration of identity garbage collection if a CiliumIdentity
deletion is conflicted. In a cluster with a significant number of CiliumIdentity objects with a high rate of identity creation/deletion, this can lead to an inability of cilium-operator to delete stale objects. If too many objects accumulate, new pods can face issues establishing network connectivity.
The initial indication of an issue was the following events on k8s pods that were failing to start in the expected amount of time:
1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5674e9fe71dabb2b8957fab2246bb238f8e0d09698ea42ba27e3b2a66949c7d": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded
1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "399a8a1213a3127502dafe0d70dbc466d117fbf8242765100cfbd636c2ad2a98": plugin type="cilium-cni" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
This led to the discovery that the affected cluster had close to 65k CiliumIdentity
objects, including many that were stale (including ones annotated with io.cilium.heartbeat
, and without). An examination of the relevant code and logs indicated that the gc procedure iterates over all CiliumIdentity
objects each run, but abandons the run if a delete attempt fails because of a conflicting update to the CiliumIdentity
being deleted. If this happens early within the iteration through the list of objects, and happens often, it's possible that cilium-operator will be unable to gc stale objects before the excess count of objects causes operational problems for Cilium.
I am not sure what's introducing the conflicts. I examined our apiserver audit logs, but they have insufficient detail to indicate what's modifying the CiliumIdentity objects between the time they are first annotated with io.cilium.heartbeat
, and the time that the gc loop attempts deletion, and we're unable to reconfigure the audit logging settings due to restrictions imposed by our cloud provider. Regardless of whether the conflict is introduced by Cilium, Cilium should gracefully handle conflicts and not abandon the entire gc iteration when a single delete is conflicted.
Pull request incoming.
Cilium Version
1.15.3
Kernel Version
5.10.215-203.850.amzn2.x86_64
Kubernetes Version
1.28
Regression
No response
Sysdump
No response
Relevant log output
1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5674e9fe71dabb2b8957fab2246bb238f8e0d09698ea42ba27e3b2a66949c7d": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded
1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "399a8a1213a3127502dafe0d70dbc466d117fbf8242765100cfbd636c2ad2a98": plugin type="cilium-cni" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
Anything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct