-
Notifications
You must be signed in to change notification settings - Fork 3.4k
ipam: Avoid empty CIDR in ENI mode #35695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7228b02
to
322c922
Compare
In some cases, there might be race condition between Instance Manager (running in operator) and IPAM (running in agent) components, this could lead to the issue, in which IP details are not populated properly, and hence cause the below fatal error. This commit is to make sure that we don't add empty CIDR into the allocation result. Just a note that once the CRD is updated, the existing resync process will kick off and perform the needful. ``` time="2024-10-22T22:48:31Z" level=fatal msg="failed to start: daemon creation failed: failed to coalesce CIDRs: invalid CIDR address: " subsys=daemon ``` Signed-off-by: Tam Mach <tam.mach@cilium.io>
322c922
to
fd1450d
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only noticed this now and I'm a bit worried that this papers over a more serious issue. Generally speaking, the agent must not hand out IPs if the ENI allocator is not ready. In particular, if we have a race where the agent somehow observes incomplete ENI information, then ignoring that missing information and handing out IPs is making things worse:
If we hand out ENI IPs to endpoints without e.g. the the VPC CIDR, then the IP rules created for that endpoint (here) will be incomplete and there is currently no way that will be reconciled later. That endpoint that got the incomplete IPAM result will remain broken forever.
In that sense, having the agent crash loop until the CiliumNode object fully populated prevents us from using incomplete wrong information. In other words, the crash was preventing us from doing even worse things. Ideally of course we would just block instead of crash, but continuing with incomplete information will lead to even worse issues I'm afraid.
This reverts commit 12bbed5. Relates: #35695 (review)
This reverts commit 12bbed5. Relates: #35695 (review) Signed-off-by: Tam Mach <tam.mach@cilium.io>
This reverts commit 12bbed5. Relates: #35695 (review) Signed-off-by: Tam Mach <tam.mach@cilium.io>
In some cases, there might be race condition between Instance Manager (running in operator) and IPAM (running in agent) modules, which could lead to IP details are not populated properly, and hence cause the below fatal error. This commit is to make sure that we don't add empty CIDR into the allocation result. Just a note that once the CRD is update, the existing resync process will kick off and perform the needful.
Relates: #32855