-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.17.4 and lower than v1.18.0
What happened?
We've identified a bug in Cilium's IPAM with AWS ENI where an IP address that was marked for release and then reassigned back to the ENI can become stuck in a state where it's both in CiliumNode's spec.ipam.pool
and in status.ipam.release-IPs
with status "released". This prevents the IP from being assigned to new pods, leading to pods stuck in ContainerCreating state with "No more IPs available" errors.
Environment
Cilium version: 1.17.2
Kubernetes: EKS 1.31
IPAM mode: AWS ENI
Here is what I believe where the issue occurs, in the IP release handshake process when the following sequence happens:
- An IP is marked for release and goes through the handshake process to "released" state
- The IP is removed from
spec.ipam.pool
as part of the release process - Before the agent can complete the handshake by removing the IP from
status.ipam.ReleaseIPs
, the same IP is reassigned to the ENI (e.g., due to PreAllocate or new pods needing IP addresses) - The IP is added back to
spec.ipam.pool
- When the agent tries to clean up, its logic checks
Lines 450 to 454 in 79e9a78
if _, ok := n.ownNode.Spec.IPAM.Pool[ip]; !ok { if status == ipamOption.IPAMReleased { // Remove entry from release-ips only when it is removed from .spec.ipam.pool as well delete(n.ownNode.Status.IPAM.ReleaseIPs, ip) releaseUpstreamSyncNeeded = true
Since the IP is now in the pool, the agent doesn't remove it from status.ipam.ReleaseIPs - The IP remains in both
spec.ipam.pool
ANDstatus.ipam.ReleaseIPs
with status "released" - When checking for available IPs,
isIPInReleaseHandshake()
returns true for this IP, making it unavailable for allocation
Restarting the Cilium operator temporarily resolves the issue as it resets the internal tracking state.
How can we reproduce the issue?
Because this issue requires tricky timing, it's not very easy to reproduce. But we've seen it a few times in a few high churning clusters.
Cilium Version
1.17.2
Kernel Version
6.8.0-1029-aws #31~22.04.1-Ubuntu
Kubernetes Version
1.31
Regression
No response
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct