Skip to content

tolerations: [{operator: Exists}] on operator prevents node drain #28549

@sterlingbates

Description

@sterlingbates

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Problem

A cilium-operator pod drained from a node is automatically rescheduled back onto that drained node. This may very well be a Kubernetes oversight, but Cilium's helm chart can be easily patched against it.

The helm chart's default toleration clause is interpreted by Kubernetes to mean a global match for any key or value:

Helm chart

  tolerations:
  - operator: Exists

Kubernetes code

// An empty key with Exists operator means match all keys & values.

Possible Fix
Plenty of other areas in the helm chart use this.

  tolerations: []

Similar

18995 is very close, but was resolved with an HA operator config. I have two operators running, but still observe the problem.

Steps to Reproduce

  1. Drain a node with cilium-operator that has only tolerations: [{operator: Exists}] in the spec.
  2. See the node is annotated with the NoSchedule taint, and unschedulable: true.
  3. Observe the operator pod deleted from the node.
  4. Observe a new operator pod immediately scheduled to the same drained node.
  5. The pod is never scheduled to another node, no matter how long you wait.

Expected

The cilium-operator pod is scheduled to another node.

Actual

The pod remains hauntingly, aggravatingly, and most deeply annoyingly fixated on the drained node. The scheduler log shows this:

I1011 18:49:14.299093       1 schedule_one.go:265] "Successfully bound pod to node" pod="kube-system/cilium-operator-59d78d96f4-khndr" node="lab1-qz2-sr1-rk18-s24-mstr001" evaluatedNodes=3 feasibleNodes=3

The feasibleNodes=3 note means that the affinity/taint filters in Kubernetes failed to identify an unschedulable node as unavailable to the operator pod.

Demo

Notes:

  1. The primary2 node is already set to SchedulingDisabled.
  2. Ignoring the fact the pod shouldn't be on there now, it certainly shouldn't after another drain.
  3. Drain the node.
  4. Re-list the pods, and note that the operator scheduled right back onto the unschedulable node.
root@primary1:~# kubectl get nodes -o wide
NAME                            STATUS                     ROLES           AGE     VERSION    INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
primary1   Ready                      control-plane   7h23m   v1.24.9    192.168.50.195   <none>        Ubuntu 18.04.5 LTS   4.15.0-1080-ibm-gt   containerd://1.6.8
primary2   Ready,SchedulingDisabled   <none>          7h19m   v1.23.11   192.168.50.197   <none>        Ubuntu 18.04.5 LTS   4.15.0-1080-ibm-gt   containerd://1.6.8
primary3   Ready                      <none>          7h19m   v1.24.9    192.168.50.199   <none>        Ubuntu 18.04.5 LTS   4.15.0-1080-ibm-gt   containerd://1.6.8

root@primary1:~# kubectl get pods -o wide -A
NAMESPACE       NAME                                 READY   STATUS    RESTARTS   AGE     IP               NODE       NOMINATED NODE   READINESS GATES
kube-system     cilium-operator-59d78d96f4-blgx4     1/1     Running   0          5h48m   192.168.50.199   primary3   <none>           <none>
kube-system     cilium-operator-59d78d96f4-khndr     1/1     Running   0          3h52m   192.168.50.197   primary2   <none>           <none>

root@primary1:~# kubectl drain primary2 --ignore-daemonsets
node/primary2 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/cilium-phlcx, kube-system/kube-proxy-rqfn2
evicting pod kube-system/cilium-operator-59d78d96f4-khndr
pod/cilium-operator-59d78d96f4-khndr evicted
node/primary2 drained

root@primary1:~# kubectl get pods -o wide -A
NAMESPACE       NAME                                 READY   STATUS    RESTARTS   AGE     IP               NODE       NOMINATED NODE   
kube-system     cilium-operator-59d78d96f4-blgx4     1/1     Running   0          5h52m   192.168.50.199   primary3   <none>           <none>
kube-system     cilium-operator-59d78d96f4-sclwl     1/1     Running   0          98s     192.168.50.197   primary2   <none>           <none>

Cilium Version

I'm running an old 1.11.1, but the Helm chart problem is on all releases.

Kernel Version

n/a

Kubernetes Version

1.24.9, but like Cilium pretty much all versions for the past four years are affected.

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentCilium agent related.area/helmImpacts helm charts and user deployment experiencearea/operatorImpacts the cilium-operator componentgood-first-issueGood starting point for new developers, which requires minimal understanding of Cilium.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions