Rolling a cluster from Kubernetes 1.30 to 1.31 gets stuck in a validation loop when new nodes are added to the cluster via CAS/Karpenter after `kops update cluster` completes

/kind bug

**1. What `kops` version are you running? The command `kops version`, will display
 this information.**
1.31.0-alpha.1

**2. What Kubernetes version are you running? `kubectl version` will print the
 version if a cluster is running or provide the Kubernetes version specified as
 a `kops` flag.**
Upgrading from 1.30.5 to 1.31.1.

**3. What cloud provider are you using?**
AWS

**4. What commands did you run?  What is the simplest way to reproduce this issue?**
Update the cluster `kubernetesVersion` and then run:
`kops update cluster`
`kops rolling-update cluster`

**5. What happened after the commands executed?**
The rolling-update got stuck in a validation loop and eventually timed out, because pods on the new worker nodes created by Karpenter after `kops update cluster` failed to start as described in https://github.com/kubernetes/kubernetes/issues/127316.

**6. What did you expect to happen?**
Would have been great if the rolling update completed without errors.

**7. Please provide your cluster manifest. Execute
  `kops get --name my.example.com -o yaml` to display your cluster manifest.
  You may want to remove your cluster name and other sensitive information.**
Only relevant part here is having Karpenter enabled and then upgrading the Kubernetes version to 1.31.1.

**8. Please run the commands with most verbose logging by adding the `-v 10` flag.
  Paste the logs into this report, or in a gist and provide the gist link here.**
Rolling update validation loop outputs things like this over and over:
```
I1016 03:06:22.203255    2989 instancegroups.go:567] Cluster did not pass validation, will retry in "30s": node "i-05f95c0b6ad6e5201" of role "node" is not ready, system-node-critical pod "calico-node-ct25f" is pending, system-node-critical pod "ebs-csi-node-sm8v6" is pending, system-node-critical pod "efs-csi-node-bmdvv" is pending.
```

Upon describing one of those pods:
```
  Warning  Failed     25m (x12 over 27m)     kubelet            Error: services have not yet been read at least once, cannot construct envvars
```

**9. Anything else we need to know?**
It should be possible to work around this issue by pausing autoscaling before `kops update cluster` until after `kops rolling-update cluster` has replaced all of the control plane nodes, or with judicious use of `kops rolling-update cluster --cloudonly`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rolling a cluster from Kubernetes 1.30 to 1.31 gets stuck in a validation loop when new nodes are added to the cluster via CAS/Karpenter after `kops update cluster` completes #16907

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rolling a cluster from Kubernetes 1.30 to 1.31 gets stuck in a validation loop when new nodes are added to the cluster via CAS/Karpenter after kops update cluster completes #16907

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Rolling a cluster from Kubernetes 1.30 to 1.31 gets stuck in a validation loop when new nodes are added to the cluster via CAS/Karpenter after `kops update cluster` completes #16907