-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Rancher Server Setup
- Rancher version: v2.7.2
- Installation option (Docker install/Helm Chart): N/A
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
- Proxy/Cert Details:
Information about the Cluster
- Kubernetes version: N/A
- Cluster Type (Local/Downstream): Downstream Custom K3s/RKE2
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) N/A
- If custom, define the set of permissions: N/A
Describe the bug
If a custom cluster is running with K3s/RKE2, it is possible that deleting a control plane + etcd node from the cluster leads to worker nodes going NotReady
perpetually without intervention. This is caused because the worker node kubelets will retain a stale connection to the kube-apiserver
on the deleted node, and will be unable to function due to this. The workaround for this is to properly clean up the infrastructure (or delete it) after deleting it from the cluster.
To Reproduce
Provision a K3s or RKE2 cluster with multiple controlplane + etcd nodes, and multiple workers
Delete a controlplane + etcd node from the cluster, and observe that some of the workers may go NotReady
and not self-recover
Result
Some of the workers may go NotReady
and not self-recover
Expected Result
The worker nodes may temporarily go NotReady
but should eventually recover.
Screenshots
Additional context
This is going to be fixed through: #41011 on the Rancher side, and rancher/rke2#4060 on the RKE2 side.
The release note should look like this:
When using custom clusters in Rancher v2.7.2 with RKE2 and K3s, deletion of nodes without cleaning up underlying infrastructure can lead to unexpected behavior if the underlying infrastructure is not cleaned up thoroughly. If deleting a custom node from your cluster, if possible, ensure you delete the underlying infrastructure for it, or alternatively, run the corresponding uninstall script for the Kubernetes distribution installed on the node. For RKE2, documentation can be found here: https://docs.rke2.io/install/uninstall?_highlight=uninstall#tarball-method and for K3s, documentation can be found here: https://docs.k3s.io/installation/uninstall