-
Notifications
You must be signed in to change notification settings - Fork 41.2k
Description
What happened?
Since upgrading to Kubernetes v1.31 we have occasional observed that kube-controller-manager
allocates a PodCIDR for a new node which is already allocated for an existing node in the cluster. We have about 200 clusters with a combined number of ca. 6k-10k nodes (autoscaling). It has happened about 10 times since upgrading a week ago, so it's rather rare but breaks the networking between the affected nodes when it happens.
We use flannel as overlay network and it uses the podCIDR assigned by kube-controller-manager
. When a new node gets an existing podCIDR assigned, then the new node doesn't get ready and it gets terminated after 10 min. We check that the node is routable by all other nodes before we mark the node ready. This never happens because the podCIDR already points to another node in the cluster. When the new node is deleted it has the very unfortunate side effect that all flannels detect the node is deleted and they delete the routes for the podCIDR, problem is just that most nodes will have routes for the other node with the same podCIDR and this then stops working.
Terminating the affected node is the best way to resolve the issue but obviously not sustainable.
What did you expect to happen?
kube-controller-manager
should never allocate the same podCIDR to multiple nodes at the same time.
How can we reproduce it (as minimally and precisely as possible)?
We haven't found a predictable way to reproduce it.
We run with the pod CIDR range: 10.2.0.0/15
.
Anything else we need to know?
This is how we configure kube-controller-manager
: https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/node-pools/master-default/userdata.yaml#L581-L648 (the image is vanilla Kubernetes built from source and hosted in a private registry)
Here is an example from one cluster with logs:
Node 1 (original):
name: ip-172-31-14-210.eu-central-1.compute.internal
podCIDR: 10.2.4.0/24
Node 2 (new):
name: ip-172-31-15-241.eu-central-1.compute.internal
podCIDR: 10.2.4.0/24
kube-controller-manager
logs filtered for the node ip-172-31-14-210.eu-central-1.compute.internal
:
I1001 16:54:22.944453 1 actual_state_of_world.go:540] "Failed to update statusUpdateNeeded field in actual state of world" logger="persistentvolume-attach-detach-controller" err="Failed to set statusUpdateNeeded to needed true, because nodeName=\"ip-172-31-14-210.eu-central-1.compute.internal\" does not exist"
I1001 16:54:23.001587 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 16:54:23.054817 1 node_lifecycle_controller.go:884] "Missing timestamp for Node. Assuming now as a timestamp" logger="node-lifecycle-controller" node="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 16:56:52.839072 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:01:57.639919 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:03:00.584960 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:03:19.873337 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:08:26.273278 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:13:32.137435 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:18:37.436066 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:20:50.207426 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:06.023052 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:06.058496 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:11.270715 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:11.376488 1 reconciler.go:275] "attacherDetacher.DetachVolume started" logger="persistentvolume-attach-detach-controller" node="ip-172-31-14-210.eu-central-1.compute.internal" volumeName="kubernetes.io/csi/ebs.csi.aws.com^vol-01a06d931a5cbc177"
I1001 17:22:11.386310 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:11.392521 1 operation_generator.go:1539] Verified volume is safe to detach for volume "pvc-0ecc2568-e98d-430d-aa57-d545f7cb8c85" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-01a06d931a5cbc177") on node "ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:18.918976 1 operation_generator.go:437] DetachVolume.Detach succeeded for volume "pvc-0ecc2568-e98d-430d-aa57-d545f7cb8c85" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-01a06d931a5cbc177") on node "ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:23:15.637313 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
kube-controller-manager
logs filtered for the node: ip-172-31-15-241.eu-central-1.compute.internal
:
I1001 17:22:45.543797 1 actual_state_of_world.go:540] "Failed to update statusUpdateNeeded field in actual state of world" logger="persistentvolume-attach-detach-controller" err="Failed to set statusUpdateNeeded to needed true, because nodeName=\"ip-172-31-15-241.eu-central-1.compute.internal\" does not exist"
I1001 17:22:45.647668 1 range_allocator.go:422] "Set node PodCIDR" logger="node-ipam-controller" node="ip-172-31-15-241.eu-central-1.compute.internal" podCIDRs=["10.2.4.0/24"]
I1001 17:22:45.648019 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.648088 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.695174 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.702231 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.724468 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.752257 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.820108 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.846131 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:48.640513 1 node_lifecycle_controller.go:884] "Missing timestamp for Node. Assuming now as a timestamp" logger="node-lifecycle-controller" node="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:02.216555 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:02.273556 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:16.215529 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:21.763576 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:46.683733 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:25:05.756696 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:25:05.786234 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
The interesting lines are:
I1001 16:54:23.001587 1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
...
I1001 17:22:45.647668 1 range_allocator.go:422] "Set node PodCIDR" logger="node-ipam-controller" node="ip-172-31-15-241.eu-central-1.compute.internal" podCIDRs=["10.2.4.0/24"]
At 16:54
the original node is Successfully synced
by the node-ipam-controller
and at 17:22
, much later, does the new node get the same podCIDR allocated as what is already allocated to the original node.
We can also see this from the flannel logs on the new node:
time="2024-10-01T17:22:58Z" level=info msg="Added new node ip-172-31-14-210.eu-central-1.compute.internal with PodCIDR 10.2.4.0/24, address 10.2.4.0"
Kubernetes version
$ kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.31.0
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here