-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
In AWS EKS, and in CAPI (CAPC) clusters, and maybe others not tested, when a VM backing a worker node fails, and that node is hosting the cilium-operator pod, node deletion is delayed significantly (15 mins+) by spurious attempts of the cilium-operator deployment to keep a cilium-operator pod running on the failed node while repeated drain attempts are being made. This results in (typically) 15 terminating instances of the cilium-operator pod, all of which have to reach their deletion grace timeout (typically 5m) before they're forcibly terminated, one-by-one.
While this is merely inefficient in EKS (where the autoscaler has already replaced the failed node), in CAPI clusters node replacement by MachineHealthCheck doesn't occur until the failed node has been successfully deleted, so scale and availability are impacted for this extended period of time. In both environments the cilium-operator's controller would not be active until this extended draining period completes.
Demonstrating with EKS:
-
Provision an EKS cluster as described in https://docs.cilium.io/en/stable/gettingstarted/k8s-install-default/.
I configured 3 nodes, to match my CAPI test configuration. -
Install and verify Cilium, as described in the same document.
-
Confirm cluster health with kubectl get nodes
-
Identify the node hosting the cilium-operator pod with kubectl -n kube-system get pods -o wide
-
In the AWS console, terminate the EC2 instance implementing that node. (Match the node name to the instance's DNS name).
In my test run I terminated node ip-192-168-102-184.us-east-2.compute.internal. -
Inspect the cluster nodes, observing that the terminated node becomes un-schedulable.
-
Begin repetitively monitoring the pods with kubectl -n kube-system get pods -o wide
Observe that additional cilum-operator pods, all-but-one in terminating state / one pending, begin accumulating.
These are reported to be placed on the failed node. -
After about 15 of these accumulate, observe that they begin being force-deleted.
-
When all are deleted, the failed node is finally drained and gets deleted.
Attached find the output of a kubectl -n kube-system get pods -o wide illustrating the many cilium-operator pods in terminating state (except one pending) that ultimately accumulated in my test run, all placed on ip-192-168-102-184.us-east-2.compute.internal, the VM that was stopped.
Cilium Version
cilium version
cilium-cli: v0.9.3 compiled with go1.17.3 on darwin/amd64
cilium image (default): v1.10.5
cilium image (stable): v1.11.2
cilium image (running): v1.10.5
Kernel Version
I do not have ssh access to the EKS instances that hosted the workers to determine this.
Kubernetes Version
v1.21.5-eks-9017834
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct