[Known Issue] Custom RKE2/K3s cluster nodes may go `NotReady` when deleting controlplane nodes

**Rancher Server Setup**
- Rancher version: v2.7.2
- Installation option (Docker install/Helm Chart): N/A
   - If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
- Proxy/Cert Details:

**Information about the Cluster**
- Kubernetes version: N/A
- Cluster Type (Local/Downstream): Downstream Custom K3s/RKE2
   - If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

**User Information**
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) N/A
  - If custom, define the set of permissions: N/A

**Describe the bug**
If a custom cluster is running with K3s/RKE2, it is possible that deleting a control plane + etcd node from the cluster leads to worker nodes going `NotReady` perpetually without intervention. This is caused because the worker node kubelets will retain a stale connection to the `kube-apiserver` on the deleted node, and will be unable to function due to this. The workaround for this is to properly clean up the infrastructure (or delete it) after deleting it from the cluster.

**To Reproduce**
Provision a K3s or RKE2 cluster with multiple controlplane + etcd nodes, and multiple workers
Delete a controlplane + etcd node from the cluster, and observe that some of the workers may go `NotReady` and not self-recover

**Result**
Some of the workers may go `NotReady` and not self-recover

**Expected Result**
The worker nodes may temporarily go `NotReady` but should eventually recover.

**Screenshots**


**Additional context**
This is going to be fixed through: https://github.com/rancher/rancher/issues/41011 on the Rancher side, and https://github.com/rancher/rke2/issues/4060 on the RKE2 side.

The release note should look like this:

>When using custom clusters in Rancher v2.7.2 with RKE2 and K3s, deletion of nodes without cleaning up underlying infrastructure can lead to unexpected behavior if the underlying infrastructure is not cleaned up thoroughly. If deleting a custom node from your cluster, if possible, ensure you delete the underlying infrastructure for it, or alternatively, run the corresponding uninstall script for the Kubernetes distribution installed on the node. For RKE2, documentation can be found here: https://docs.rke2.io/install/uninstall?_highlight=uninstall#tarball-method and for K3s, documentation can be found here: https://docs.k3s.io/installation/uninstall

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Known Issue] Custom RKE2/K3s cluster nodes may go `NotReady` when deleting controlplane nodes #41034

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Known Issue] Custom RKE2/K3s cluster nodes may go NotReady when deleting controlplane nodes #41034

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Known Issue] Custom RKE2/K3s cluster nodes may go `NotReady` when deleting controlplane nodes #41034