Skip to content

Conversation

sayboras
Copy link
Member

@sayboras sayboras commented Nov 5, 2024

Description

In some cases, there might be race condition between Instance Manager
(running in operator) and IPAM (running in agent) modules, which could
lead to IP details are not populated properly, and hence cause the below
fatal error. This commit is to make sure that we don't proceed if the
required CIDRs are not populated.

time="2024-10-22T22:48:31Z" level=fatal msg="failed to start: daemon creation failed: failed to coalesce CIDRs: invalid CIDR address: " subsys=daemon

Signed-off-by: Tam Mach tam.mach@cilium.io


Testing

Testing was done locally by rapidly scaling up the cluster (to increase the likelihood of race condition), no agent restart was observed.

$ ksyslo $cilium_pod | grep "ENI state is not consistent yet"
Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), wait-for-node-init (init), clean-cilium-state (init), wait-for-kube-proxy (init), install-cni-binaries (init)
time="2024-11-05T14:26:13Z" level=warning msg="ENI state is not consistent yet" error="VPC Primary CIDR not set for ENI eni-08c59e332a31361c2" subsys=ipam
time="2024-11-05T14:26:13Z" level=warning msg="ENI state is not consistent yet" error="VPC Primary CIDR not set for ENI eni-08c59e332a31361c2" subsys=ipam
time="2024-11-05T14:27:49Z" level=warning msg="ENI state is not consistent yet" error="VPC Primary CIDR not set for ENI eni-0980fa15122c943cf" subsys=ipam
time="2024-11-05T14:27:53Z" level=warning msg="ENI state is not consistent yet" error="VPC Primary CIDR not set for ENI eni-0980fa15122c943cf" subsys=ipam
Scaling up the cluster
$ ksysgpo -l app.kubernetes.io/name=cilium-agent -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP           NODE                                            NOMINATED NODE   READINESS GATES
cilium-4bws7   1/1     Running   0          62m   10.1.1.96    ip-10-1-1-96.ap-southeast-2.compute.internal    <none>           <none>
cilium-4gsj4   1/1     Running   0          39m   10.1.2.164   ip-10-1-2-164.ap-southeast-2.compute.internal   <none>           <none>
cilium-66dv4   1/1     Running   0          57m   10.1.2.193   ip-10-1-2-193.ap-southeast-2.compute.internal   <none>           <none>
cilium-6zlgz   1/1     Running   0          39m   10.1.1.191   ip-10-1-1-191.ap-southeast-2.compute.internal   <none>           <none>
cilium-7ppc9   1/1     Running   0          39m   10.1.0.156   ip-10-1-0-156.ap-southeast-2.compute.internal   <none>           <none>
cilium-7xj6z   1/1     Running   0          62m   10.1.0.126   ip-10-1-0-126.ap-southeast-2.compute.internal   <none>           <none>
cilium-889mh   1/1     Running   0          57m   10.1.2.90    ip-10-1-2-90.ap-southeast-2.compute.internal    <none>           <none>
cilium-8ntrx   1/1     Running   0          39m   10.1.2.145   ip-10-1-2-145.ap-southeast-2.compute.internal   <none>           <none>
cilium-97jxz   1/1     Running   0          62m   10.1.0.237   ip-10-1-0-237.ap-southeast-2.compute.internal   <none>           <none>
cilium-9pf5k   1/1     Running   0          39m   10.1.1.199   ip-10-1-1-199.ap-southeast-2.compute.internal   <none>           <none>
cilium-b2c2b   1/1     Running   0          39m   10.1.1.146   ip-10-1-1-146.ap-southeast-2.compute.internal   <none>           <none>
cilium-bssjx   1/1     Running   0          39m   10.1.1.175   ip-10-1-1-175.ap-southeast-2.compute.internal   <none>           <none>
cilium-c4h4j   1/1     Running   0          39m   10.1.0.111   ip-10-1-0-111.ap-southeast-2.compute.internal   <none>           <none>
cilium-cqmd6   1/1     Running   0          39m   10.1.2.16    ip-10-1-2-16.ap-southeast-2.compute.internal    <none>           <none>
cilium-dfphb   1/1     Running   0          39m   10.1.1.147   ip-10-1-1-147.ap-southeast-2.compute.internal   <none>           <none>
cilium-fc5dc   1/1     Running   0          39m   10.1.0.166   ip-10-1-0-166.ap-southeast-2.compute.internal   <none>           <none>
cilium-fg86w   1/1     Running   0          62m   10.1.1.17    ip-10-1-1-17.ap-southeast-2.compute.internal    <none>           <none>
cilium-fqm7v   1/1     Running   0          62m   10.1.1.4     ip-10-1-1-4.ap-southeast-2.compute.internal     <none>           <none>
cilium-gn589   1/1     Running   0          62m   10.1.0.131   ip-10-1-0-131.ap-southeast-2.compute.internal   <none>           <none>
cilium-h5jpr   1/1     Running   0          62m   10.1.2.9     ip-10-1-2-9.ap-southeast-2.compute.internal     <none>           <none>
cilium-hklxg   1/1     Running   0          57m   10.1.1.162   ip-10-1-1-162.ap-southeast-2.compute.internal   <none>           <none>
cilium-j49t9   1/1     Running   0          62m   10.1.2.38    ip-10-1-2-38.ap-southeast-2.compute.internal    <none>           <none>
cilium-j9msw   1/1     Running   0          57m   10.1.2.41    ip-10-1-2-41.ap-southeast-2.compute.internal    <none>           <none>
cilium-kkj2k   1/1     Running   0          62m   10.1.2.252   ip-10-1-2-252.ap-southeast-2.compute.internal   <none>           <none>
cilium-mclb2   1/1     Running   0          57m   10.1.1.176   ip-10-1-1-176.ap-southeast-2.compute.internal   <none>           <none>
cilium-mcs5c   1/1     Running   0          57m   10.1.0.19    ip-10-1-0-19.ap-southeast-2.compute.internal    <none>           <none>
cilium-nbbjs   1/1     Running   0          39m   10.1.2.76    ip-10-1-2-76.ap-southeast-2.compute.internal    <none>           <none>
cilium-nkh6r   1/1     Running   0          62m   10.1.1.61    ip-10-1-1-61.ap-southeast-2.compute.internal    <none>           <none>
cilium-pdcfn   1/1     Running   0          57m   10.1.2.47    ip-10-1-2-47.ap-southeast-2.compute.internal    <none>           <none>
cilium-q486w   1/1     Running   0          39m   10.1.0.42    ip-10-1-0-42.ap-southeast-2.compute.internal    <none>           <none>
cilium-q5h5p   1/1     Running   0          39m   10.1.1.204   ip-10-1-1-204.ap-southeast-2.compute.internal   <none>           <none>
cilium-sqfsb   1/1     Running   0          39m   10.1.1.131   ip-10-1-1-131.ap-southeast-2.compute.internal   <none>           <none>
cilium-tbd9v   1/1     Running   0          39m   10.1.0.5     ip-10-1-0-5.ap-southeast-2.compute.internal     <none>           <none>
cilium-tz867   1/1     Running   0          39m   10.1.0.64    ip-10-1-0-64.ap-southeast-2.compute.internal    <none>           <none>
cilium-vggbh   1/1     Running   0          39m   10.1.0.48    ip-10-1-0-48.ap-southeast-2.compute.internal    <none>           <none>
cilium-x24qh   1/1     Running   0          39m   10.1.2.190   ip-10-1-2-190.ap-southeast-2.compute.internal   <none>           <none>
cilium-x2jsk   1/1     Running   0          57m   10.1.0.244   ip-10-1-0-244.ap-southeast-2.compute.internal   <none>           <none>
cilium-xsxx4   1/1     Running   0          57m   10.1.1.158   ip-10-1-1-158.ap-southeast-2.compute.internal   <none>           <none>
cilium-z4qpz   1/1     Running   0          39m   10.1.2.215   ip-10-1-2-215.ap-southeast-2.compute.internal   <none>           <none>

@maintainer-s-little-helper
Copy link

Commit 8d2fbf2 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 5, 2024
@sayboras sayboras requested a review from gandro November 5, 2024 14:55
This reverts commit 12bbed5.

Relates: #35695 (review)
Signed-off-by: Tam Mach <tam.mach@cilium.io>
@sayboras sayboras force-pushed the pr/tammach/eni-empty-cidr-take-2 branch from c75508e to 14bdbf7 Compare November 5, 2024 14:56
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 5, 2024
@sayboras sayboras added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.15 needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Nov 5, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 5, 2024
@sayboras sayboras force-pushed the pr/tammach/eni-empty-cidr-take-2 branch from 14bdbf7 to daac74f Compare November 5, 2024 15:14
@sayboras sayboras marked this pull request as ready for review November 5, 2024 15:14
@sayboras sayboras requested a review from a team as a code owner November 5, 2024 15:14
@sayboras sayboras added the backport/author The backport will be carried out by the author of the PR. label Nov 5, 2024
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for tackling this. One minor non-blocking nit

In some cases, there might be race condition between Instance Manager
(running in operator) and IPAM (running in agent) modules, which could
lead to IP details are not populated properly, and hence cause the below
fatal error. This commit is to make sure that we don't proceed if the
required CIDRs are not populated.

```
time="2024-10-22T22:48:31Z" level=fatal msg="failed to start: daemon creation failed: failed to coalesce CIDRs: invalid CIDR address: " subsys=daemon
```
Suggested-by: Sebastian Wicki <gandro@gmx.net>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
@sayboras
Copy link
Member Author

sayboras commented Nov 5, 2024

/test

@sayboras sayboras force-pushed the pr/tammach/eni-empty-cidr-take-2 branch from daac74f to 40fa07b Compare November 5, 2024 22:05
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 5, 2024
@sayboras sayboras added this pull request to the merge queue Nov 5, 2024
Merged via the queue into main with commit 0a5054a Nov 5, 2024
273 checks passed
@sayboras sayboras deleted the pr/tammach/eni-empty-cidr-take-2 branch November 5, 2024 23:40
@sayboras sayboras mentioned this pull request Nov 5, 2024
1 task
@sayboras sayboras added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Nov 5, 2024
@sayboras sayboras mentioned this pull request Nov 5, 2024
1 task
@sayboras sayboras added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Nov 5, 2024
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/author The backport will be carried out by the author of the PR. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants