-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
/kind bug
1. What kops
version are you running? The command kops version
, will display
this information.
1.31.0-alpha.1
2. What Kubernetes version are you running? kubectl version
will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops
flag.
Upgrading from 1.30.5 to 1.31.1.
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Update the cluster kubernetesVersion
and then run:
kops update cluster
kops rolling-update cluster
5. What happened after the commands executed?
The rolling-update got stuck in a validation loop and eventually timed out, because pods on the new worker nodes created by Karpenter after kops update cluster
failed to start as described in kubernetes/kubernetes#127316.
6. What did you expect to happen?
Would have been great if the rolling update completed without errors.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
Only relevant part here is having Karpenter enabled and then upgrading the Kubernetes version to 1.31.1.
8. Please run the commands with most verbose logging by adding the -v 10
flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Rolling update validation loop outputs things like this over and over:
I1016 03:06:22.203255 2989 instancegroups.go:567] Cluster did not pass validation, will retry in "30s": node "i-05f95c0b6ad6e5201" of role "node" is not ready, system-node-critical pod "calico-node-ct25f" is pending, system-node-critical pod "ebs-csi-node-sm8v6" is pending, system-node-critical pod "efs-csi-node-bmdvv" is pending.
Upon describing one of those pods:
Warning Failed 25m (x12 over 27m) kubelet Error: services have not yet been read at least once, cannot construct envvars
9. Anything else we need to know?
It should be possible to work around this issue by pausing autoscaling before kops update cluster
until after kops rolling-update cluster
has replaced all of the control plane nodes, or with judicious use of kops rolling-update cluster --cloudonly
.