Skip to content

[BUG] rancher machines are not removed from the cluster after actual worker nodes removed #43686

@riuvshyn

Description

@riuvshyn

Rancher Server Setup

  • Rancher version: 2.7.9, 2.8.0
  • Installation option (Docker install/Helm Chart): helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
  • Proxy/Cert Details: none

Information about the Cluster

  • Kubernetes version: 1.27.8
  • Cluster Type (Local/Downstream): downstream custom rke2 on hosted on aws
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions:

Describe the bug
After upgrade from 2.7.5 to 2.7.9 I've noticed that terminated worker nodes are still displayed on rancher UI with Nodenotfound status.

capi-controller-manager: is full of errors like:

E1204 17:41:44.435523       1 controller.go:329] "Reconciler error" err="no matching Node for Machine \"custom-fbdc7789f02e\" in namespace \"fleet-default\": cannot find node with matching ProviderID" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/custom-fbdc7789f02e" namespace="fleet-default" name="custom-fbdc7789f02e" reconcileID=330124c5-1ffb-4f7d-a618-94a976c62106
E1204 17:41:44.585423       1 controller.go:329] "Reconciler error" err="no matching Node for Machine \"custom-be9d831e6358\" in namespace \"fleet-default\": cannot find node with matching ProviderID" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="fleet-default/custom-be9d831e6358" namespace="fleet-default" name="custom-be9d831e6358" reconcileID=63186033-4f57-4524-85af-2123a88f3a04

machine resource mentioned in the error still exists:
k -n fleet-default get machine | grep custom-be9d831e6358

custom-be9d831e6358   rancher-euc1-te-test01   ip-172-29-196-238.eu-central-1.compute.internal   aws:///eu-central-1a/i-0a82b430f14175488   Running       5h6m

but nodes.management.cattle.io resources for old workers are gone

Also in downstream rke2 cluster k get nodes doesn't show these old nodes, so it is only affecting rancher.

To Reproduce

  • Terminate active running node

Result
After node is terminated it is removed from the cluster but it is not removed from rancher and node still displayed on UI in Nodenotfound status.

Expected Result
After node is terminated it is removed from the cluster and it is also removed from rancher.

Screenshots
image

Additional context
In slack other ppl mentioned that they have same issue starting from version 2.7.6 https://rancher-users.slack.com/archives/C3ASABBD1/p1701711546706849\

SURE-8277

Metadata

Metadata

Labels

QA/Marea/capiProvisioning issues that deal correspond with CAPIarea/caprProvisioning issues that involve cluster-api-provider-rancherkind/bugIssues that are defects reported by users or that we know have reached a real releasepriority/0release-noteNote this issue in the milestone's release notesstatus/release-note-addedteam/hostbustersThe team that is responsible for provisioning/managing downstream clusters + K8s version support

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions