Skip to content

Cilium abandons identity garbage collection if a CiliumIdentity deletion is conflicted #33142

@JacobHenner

Description

@JacobHenner

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Cilium abandons an iteration of identity garbage collection if a CiliumIdentity deletion is conflicted. In a cluster with a significant number of CiliumIdentity objects with a high rate of identity creation/deletion, this can lead to an inability of cilium-operator to delete stale objects. If too many objects accumulate, new pods can face issues establishing network connectivity.

The initial indication of an issue was the following events on k8s pods that were failing to start in the expected amount of time:

1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5674e9fe71dabb2b8957fab2246bb238f8e0d09698ea42ba27e3b2a66949c7d": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded
1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "399a8a1213a3127502dafe0d70dbc466d117fbf8242765100cfbd636c2ad2a98": plugin type="cilium-cni" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests

This led to the discovery that the affected cluster had close to 65k CiliumIdentity objects, including many that were stale (including ones annotated with io.cilium.heartbeat, and without). An examination of the relevant code and logs indicated that the gc procedure iterates over all CiliumIdentity objects each run, but abandons the run if a delete attempt fails because of a conflicting update to the CiliumIdentity being deleted. If this happens early within the iteration through the list of objects, and happens often, it's possible that cilium-operator will be unable to gc stale objects before the excess count of objects causes operational problems for Cilium.

I am not sure what's introducing the conflicts. I examined our apiserver audit logs, but they have insufficient detail to indicate what's modifying the CiliumIdentity objects between the time they are first annotated with io.cilium.heartbeat, and the time that the gc loop attempts deletion, and we're unable to reconfigure the audit logging settings due to restrictions imposed by our cloud provider. Regardless of whether the conflict is introduced by Cilium, Cilium should gracefully handle conflicts and not abandon the entire gc iteration when a single delete is conflicted.

Pull request incoming.

Cilium Version

1.15.3

Kernel Version

5.10.215-203.850.amzn2.x86_64

Kubernetes Version

1.28

Regression

No response

Sysdump

No response

Relevant log output

1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5674e9fe71dabb2b8957fab2246bb238f8e0d09698ea42ba27e3b2a66949c7d": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "399a8a1213a3127502dafe0d70dbc466d117fbf8242765100cfbd636c2ad2a98": plugin type="cilium-cni" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.sig/policyImpacts whether traffic is allowed or denied based on user-defined policies.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions