Skip to content

kube-controller-manager re-allocating already used PodCIDR (v1.31) #127792

@mikkeloscar

Description

@mikkeloscar

What happened?

Since upgrading to Kubernetes v1.31 we have occasional observed that kube-controller-manager allocates a PodCIDR for a new node which is already allocated for an existing node in the cluster. We have about 200 clusters with a combined number of ca. 6k-10k nodes (autoscaling). It has happened about 10 times since upgrading a week ago, so it's rather rare but breaks the networking between the affected nodes when it happens.

We use flannel as overlay network and it uses the podCIDR assigned by kube-controller-manager. When a new node gets an existing podCIDR assigned, then the new node doesn't get ready and it gets terminated after 10 min. We check that the node is routable by all other nodes before we mark the node ready. This never happens because the podCIDR already points to another node in the cluster. When the new node is deleted it has the very unfortunate side effect that all flannels detect the node is deleted and they delete the routes for the podCIDR, problem is just that most nodes will have routes for the other node with the same podCIDR and this then stops working.

Terminating the affected node is the best way to resolve the issue but obviously not sustainable.

What did you expect to happen?

kube-controller-manager should never allocate the same podCIDR to multiple nodes at the same time.

How can we reproduce it (as minimally and precisely as possible)?

We haven't found a predictable way to reproduce it.

We run with the pod CIDR range: 10.2.0.0/15.

Anything else we need to know?

This is how we configure kube-controller-manager: https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/node-pools/master-default/userdata.yaml#L581-L648 (the image is vanilla Kubernetes built from source and hosted in a private registry)

Here is an example from one cluster with logs:

Node 1 (original):
name: ip-172-31-14-210.eu-central-1.compute.internal
podCIDR: 10.2.4.0/24

Node 2 (new):
name: ip-172-31-15-241.eu-central-1.compute.internal
podCIDR: 10.2.4.0/24

kube-controller-manager logs filtered for the node ip-172-31-14-210.eu-central-1.compute.internal:

I1001 16:54:22.944453       1 actual_state_of_world.go:540] "Failed to update statusUpdateNeeded field in actual state of world" logger="persistentvolume-attach-detach-controller" err="Failed to set statusUpdateNeeded to needed true, because nodeName=\"ip-172-31-14-210.eu-central-1.compute.internal\" does not exist"
I1001 16:54:23.001587       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 16:54:23.054817       1 node_lifecycle_controller.go:884] "Missing timestamp for Node. Assuming now as a timestamp" logger="node-lifecycle-controller" node="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 16:56:52.839072       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:01:57.639919       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:03:00.584960       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:03:19.873337       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:08:26.273278       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:13:32.137435       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:18:37.436066       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:20:50.207426       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:06.023052       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:06.058496       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:11.270715       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:11.376488       1 reconciler.go:275] "attacherDetacher.DetachVolume started" logger="persistentvolume-attach-detach-controller" node="ip-172-31-14-210.eu-central-1.compute.internal" volumeName="kubernetes.io/csi/ebs.csi.aws.com^vol-01a06d931a5cbc177"
I1001 17:22:11.386310       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:11.392521       1 operation_generator.go:1539] Verified volume is safe to detach for volume "pvc-0ecc2568-e98d-430d-aa57-d545f7cb8c85" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-01a06d931a5cbc177") on node "ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:22:18.918976       1 operation_generator.go:437] DetachVolume.Detach succeeded for volume "pvc-0ecc2568-e98d-430d-aa57-d545f7cb8c85" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-01a06d931a5cbc177") on node "ip-172-31-14-210.eu-central-1.compute.internal"
I1001 17:23:15.637313       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"

kube-controller-manager logs filtered for the node: ip-172-31-15-241.eu-central-1.compute.internal:

I1001 17:22:45.543797       1 actual_state_of_world.go:540] "Failed to update statusUpdateNeeded field in actual state of world" logger="persistentvolume-attach-detach-controller" err="Failed to set statusUpdateNeeded to needed true, because nodeName=\"ip-172-31-15-241.eu-central-1.compute.internal\" does not exist"
I1001 17:22:45.647668       1 range_allocator.go:422] "Set node PodCIDR" logger="node-ipam-controller" node="ip-172-31-15-241.eu-central-1.compute.internal" podCIDRs=["10.2.4.0/24"]
I1001 17:22:45.648019       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.648088       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.695174       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.702231       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.724468       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.752257       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.820108       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:45.846131       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:22:48.640513       1 node_lifecycle_controller.go:884] "Missing timestamp for Node. Assuming now as a timestamp" logger="node-lifecycle-controller" node="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:02.216555       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:02.273556       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:16.215529       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:21.763576       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:23:46.683733       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:25:05.756696       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"
I1001 17:25:05.786234       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-15-241.eu-central-1.compute.internal"

The interesting lines are:

I1001 16:54:23.001587       1 range_allocator.go:241] "Successfully synced" logger="node-ipam-controller" key="ip-172-31-14-210.eu-central-1.compute.internal"
...
I1001 17:22:45.647668       1 range_allocator.go:422] "Set node PodCIDR" logger="node-ipam-controller" node="ip-172-31-15-241.eu-central-1.compute.internal" podCIDRs=["10.2.4.0/24"]

At 16:54 the original node is Successfully synced by the node-ipam-controller and at 17:22, much later, does the new node get the same podCIDR allocated as what is already allocated to the original node.

We can also see this from the flannel logs on the new node:

 time="2024-10-01T17:22:58Z" level=info msg="Added new node ip-172-31-14-210.eu-central-1.compute.internal with PodCIDR 10.2.4.0/24, address 10.2.4.0"

Kubernetes version

$ kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.31.0

Cloud provider

AWS running a custom Kubernetes setup on EC2

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.kind/regressionCategorizes issue or PR as related to a regression from a prior release.sig/api-machineryCategorizes an issue or PR as relevant to SIG API Machinery.sig/networkCategorizes an issue or PR as relevant to SIG Network.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions