-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Is your feature request related to a problem? Please describe.
In provisioning-v2, a custom node that is deleted from a cluster is not properly cleaned up. In the case of a controlplane node, this can lead to unexpected behavior (and etcd/worker nodes going into a perpetual NotReady
state until either the deleted node is shut down or removed from the cluster. Additionally, in the case of non-etcd nodes, it is possible to have a custom node that was deleted and rebooted rejoin the cluster if rke2-server
/rke2-agent
/k3s
or k3s-agent
is started on it, as it will re-register into the cluster.
We need to provide the ability to clean up custom nodes from a cluster when they are deleted. We do not specifically have to worry about machine provisioned infrastructure as in most cases, that infrastructure is deleted on machine deletion.
Describe the solution you'd like
Add the ability to deliver an uninstall
plan to a deleted custom node. This can be done through the rancher-system-agent
, through the use of a finalizer on the machine object that triggers off deletion logic. I am thinking we can implement a pre-delete hook that has custom logic implemented to perform deletion of the custom node.
This functionality should also be able to be disabled.
Note that there are caveats and edge cases to this, and it is important to try and minimize the potential for accidental data loss. For example, certain cloud-providers may delete the node
object if the node is offline, which could accidentally trigger machine deletion logic on our end via the nodesyncer
.
Describe alternatives you've considered
Release note and/or workaround documentation
Additional context
The specific issue that I encountered that spawned this RFE is here: rancher/rke2#4060