Skip to content

[Known Issue] Custom RKE2/K3s cluster nodes may go NotReady when deleting controlplane nodes #41034

@Oats87

Description

@Oats87

Rancher Server Setup

  • Rancher version: v2.7.2
  • Installation option (Docker install/Helm Chart): N/A
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: N/A
  • Cluster Type (Local/Downstream): Downstream Custom K3s/RKE2
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) N/A
    • If custom, define the set of permissions: N/A

Describe the bug
If a custom cluster is running with K3s/RKE2, it is possible that deleting a control plane + etcd node from the cluster leads to worker nodes going NotReady perpetually without intervention. This is caused because the worker node kubelets will retain a stale connection to the kube-apiserver on the deleted node, and will be unable to function due to this. The workaround for this is to properly clean up the infrastructure (or delete it) after deleting it from the cluster.

To Reproduce
Provision a K3s or RKE2 cluster with multiple controlplane + etcd nodes, and multiple workers
Delete a controlplane + etcd node from the cluster, and observe that some of the workers may go NotReady and not self-recover

Result
Some of the workers may go NotReady and not self-recover

Expected Result
The worker nodes may temporarily go NotReady but should eventually recover.

Screenshots

Additional context
This is going to be fixed through: #41011 on the Rancher side, and rancher/rke2#4060 on the RKE2 side.

The release note should look like this:

When using custom clusters in Rancher v2.7.2 with RKE2 and K3s, deletion of nodes without cleaning up underlying infrastructure can lead to unexpected behavior if the underlying infrastructure is not cleaned up thoroughly. If deleting a custom node from your cluster, if possible, ensure you delete the underlying infrastructure for it, or alternatively, run the corresponding uninstall script for the Kubernetes distribution installed on the node. For RKE2, documentation can be found here: https://docs.rke2.io/install/uninstall?_highlight=uninstall#tarball-method and for K3s, documentation can be found here: https://docs.k3s.io/installation/uninstall

Metadata

Metadata

Assignees

Labels

area/provisioning-v2Provisioning issues that are specific to the provisioningv2 generating frameworkkind/bugIssues that are defects reported by users or that we know have reached a real releaserelease-noteNote this issue in the milestone's release notes

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions