Cilium abandons identity garbage collection if a CiliumIdentity deletion is conflicted

### Is there an existing issue for this?

- [X] I have searched the existing issues

### What happened?

Cilium abandons an iteration of identity garbage collection if a `CiliumIdentity` deletion is conflicted. In a cluster with a significant number of CiliumIdentity objects with a high rate of identity creation/deletion, this can lead to an inability of cilium-operator to delete stale objects. If too many objects accumulate, new pods can face issues establishing network connectivity.

The initial indication of an issue was the following events on k8s pods that were failing to start in the expected amount of time:

```
1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5674e9fe71dabb2b8957fab2246bb238f8e0d09698ea42ba27e3b2a66949c7d": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded
1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "399a8a1213a3127502dafe0d70dbc466d117fbf8242765100cfbd636c2ad2a98": plugin type="cilium-cni" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
```

This led to the discovery that the affected cluster had close to 65k `CiliumIdentity` objects, including many that were stale (including ones annotated with `io.cilium.heartbeat`, and without). An examination of the relevant code and logs indicated that the gc procedure iterates over all `CiliumIdentity` objects each run, but abandons the run if a delete attempt fails because of a conflicting update to the `CiliumIdentity` being deleted. If this happens early within the iteration through the list of objects, and happens often, it's possible that cilium-operator will be unable to gc stale objects before the excess count of objects causes operational problems for Cilium.

I am not sure what's introducing the conflicts. I examined our apiserver audit logs, but they have insufficient detail to indicate what's modifying the CiliumIdentity objects between the time they are first annotated with `io.cilium.heartbeat`, and the time that the gc loop attempts deletion, and we're unable to reconfigure the audit logging settings due to restrictions imposed by our cloud provider. Regardless of whether the conflict is introduced by Cilium, Cilium should gracefully handle conflicts and not abandon the entire gc iteration when a single delete is conflicted.

Pull request incoming.

### Cilium Version

1.15.3

### Kernel Version

`5.10.215-203.850.amzn2.x86_64`

### Kubernetes Version

1.28

### Regression

_No response_

### Sysdump

_No response_

### Relevant log output

```shell
1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5674e9fe71dabb2b8957fab2246bb238f8e0d09698ea42ba27e3b2a66949c7d": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

1 FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "399a8a1213a3127502dafe0d70dbc466d117fbf8242765100cfbd636c2ad2a98": plugin type="cilium-cni" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
```


### Anything else?

_No response_

### Cilium Users Document

- [X] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cilium abandons identity garbage collection if a CiliumIdentity deletion is conflicted #33142

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cilium abandons identity garbage collection if a CiliumIdentity deletion is conflicted #33142

Description

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions