AWS ENI IP addresses can get stuck released status, preventing allocation to pods

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Version

equal or higher than v1.17.4 and lower than v1.18.0

### What happened?

We've identified a bug in Cilium's IPAM with AWS ENI where an IP address that was marked for release and then reassigned back to the ENI can become stuck in a state where it's both in CiliumNode's `spec.ipam.pool` and in `status.ipam.release-IPs` with status "released". This prevents the IP from being assigned to new pods, leading to pods stuck in ContainerCreating state with "No more IPs available" errors. 

#### Environment
Cilium version: `1.17.2`
Kubernetes: EKS `1.31`
IPAM mode: AWS ENI

Here is what I believe where the issue occurs, in the IP release handshake process when the following sequence happens:
1. An IP is marked for release and goes through the handshake process to "released" state
2. The IP is removed from `spec.ipam.pool` as part of the release process
3. Before the agent can complete the handshake by removing the IP from `status.ipam.ReleaseIPs`, the same IP is reassigned to the ENI (e.g., due to PreAllocate or new pods needing IP addresses)
4. The IP is added back to `spec.ipam.pool`
5. When the agent tries to clean up, its logic checks https://github.com/cilium/cilium/blob/79e9a78f01111baf7221cf337aa955a1fd1159bd/pkg/ipam/crd.go#L450-L454
Since the IP is now in the pool, the agent doesn't remove it from status.ipam.ReleaseIPs
6. The IP remains in both `spec.ipam.pool` AND `status.ipam.ReleaseIPs` with status "released"
7. When checking for available IPs, [`isIPInReleaseHandshake()`](https://github.com/cilium/cilium/blob/79e9a78f01111baf7221cf337aa955a1fd1159bd/pkg/ipam/crd.go#L611-L622) returns true for this IP, making it unavailable for allocation

Restarting the Cilium operator temporarily resolves the issue as it resets the internal tracking state.

### How can we reproduce the issue?

Because this issue requires tricky timing, it's not very easy to reproduce. But we've seen it a few times in a few high churning clusters. 

### Cilium Version

`1.17.2`

### Kernel Version

`6.8.0-1029-aws #31~22.04.1-Ubuntu`

### Kubernetes Version

`1.31`

### Regression

_No response_

### Sysdump

_No response_

### Relevant log output

```shell

```

### Anything else?

_No response_

### Cilium Users Document

- [ ] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

	if _, ok := n.ownNode.Spec.IPAM.Pool[ip]; !ok {
	if status == ipamOption.IPAMReleased {
	// Remove entry from release-ips only when it is removed from .spec.ipam.pool as well
	delete(n.ownNode.Status.IPAM.ReleaseIPs, ip)
	releaseUpstreamSyncNeeded = true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AWS ENI IP addresses can get stuck released status, preventing allocation to pods #39981

Is there an existing issue for this?

Version

What happened?

Environment

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AWS ENI IP addresses can get stuck released status, preventing allocation to pods #39981

Description

Is there an existing issue for this?

Version

What happened?

Environment

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions