cilium-operator deployment significantly delays node draining on infra failure

### Is there an existing issue for this?

- [X] I have searched the existing issues

### What happened?

In AWS EKS, and in CAPI (CAPC) clusters, and maybe others not tested, when a VM backing a worker node fails, and that node is hosting the cilium-operator pod, node deletion is delayed significantly (15 mins+) by spurious attempts of the cilium-operator deployment to keep a cilium-operator pod running on the failed node while repeated drain attempts are being made.  This results in (typically) 15 terminating instances of the cilium-operator pod, all of which have to reach their deletion grace timeout (typically 5m) before they're forcibly terminated, one-by-one.

While this is merely inefficient in EKS (where the autoscaler has already replaced the failed node), in CAPI clusters node replacement by MachineHealthCheck doesn't occur _until_ the failed node has been successfully deleted, so scale and availability are impacted for this extended period of time.  In both environments the cilium-operator's controller would not be active until this extended draining period completes.

**Demonstrating with EKS:**
1) Provision an EKS cluster as described in https://docs.cilium.io/en/stable/gettingstarted/k8s-install-default/.
    I configured 3 nodes, to match my CAPI test configuration.
    
2) Install and verify Cilium, as described in the same document.

3) Confirm cluster health with _kubectl get nodes_

4) Identify the node hosting the cilium-operator pod with _kubectl -n kube-system get pods -o wide_

5) In the AWS console, terminate the EC2 instance implementing that node.  (Match the node name to the instance's DNS name).
     In my test run I terminated node _ip-192-168-102-184.us-east-2.compute.internal._  

6) Inspect the cluster nodes, observing that the terminated node becomes un-schedulable.

6) Begin repetitively monitoring the pods with _kubectl -n kube-system get pods -o wide_
     Observe that additional cilum-operator pods, all-but-one in terminating state / one pending, begin accumulating.  
     These are reported to be placed on the failed node.
     
7) After about 15 of these accumulate, observe that they begin being force-deleted.

8) When all are deleted, the failed node is finally drained and gets deleted.

Attached find the output of a _kubectl -n kube-system get pods -o wide_ illustrating the many cilium-operator pods in terminating state (except one pending) that ultimately accumulated in my test run, all placed on ip-192-168-102-184.us-east-2.compute.internal, the VM that was stopped.

[ciliumGetPods.txt](https://github.com/cilium/cilium/files/8164972/ciliumGetPods.txt)

### Cilium Version

cilium version
cilium-cli: v0.9.3 compiled with go1.17.3 on darwin/amd64
cilium image (default): v1.10.5
cilium image (stable): v1.11.2
cilium image (running): v1.10.5


### Kernel Version

I do not have ssh access to the EKS instances that hosted the workers to determine this.

### Kubernetes Version

v1.21.5-eks-9017834

### Sysdump

_No response_

### Relevant log output

_No response_

### Anything else?

_No response_

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cilium-operator deployment significantly delays node draining on infra failure #18995

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cilium-operator deployment significantly delays node draining on infra failure #18995

Description

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Sysdump

Relevant log output

Anything else?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions