-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Problem
A cilium-operator
pod drained from a node is automatically rescheduled back onto that drained node. This may very well be a Kubernetes oversight, but Cilium's helm chart can be easily patched against it.
The helm chart's default toleration clause is interpreted by Kubernetes to mean a global match for any key or value:
tolerations:
- operator: Exists
// An empty key with Exists operator means match all keys & values.
Possible Fix
Plenty of other areas in the helm chart use this.
tolerations: []
Similar
18995 is very close, but was resolved with an HA operator config. I have two operators running, but still observe the problem.
Steps to Reproduce
- Drain a node with
cilium-operator
that has onlytolerations: [{operator: Exists}]
in the spec. - See the node is annotated with the
NoSchedule
taint, andunschedulable: true
. - Observe the operator pod deleted from the node.
- Observe a new operator pod immediately scheduled to the same drained node.
- The pod is never scheduled to another node, no matter how long you wait.
Expected
The cilium-operator
pod is scheduled to another node.
Actual
The pod remains hauntingly, aggravatingly, and most deeply annoyingly fixated on the drained node. The scheduler log shows this:
I1011 18:49:14.299093 1 schedule_one.go:265] "Successfully bound pod to node" pod="kube-system/cilium-operator-59d78d96f4-khndr" node="lab1-qz2-sr1-rk18-s24-mstr001" evaluatedNodes=3 feasibleNodes=3
The feasibleNodes=3
note means that the affinity/taint filters in Kubernetes failed to identify an unschedulable node as unavailable to the operator pod.
Demo
Notes:
- The
primary2
node is already set to SchedulingDisabled. - Ignoring the fact the pod shouldn't be on there now, it certainly shouldn't after another drain.
- Drain the node.
- Re-list the pods, and note that the operator scheduled right back onto the unschedulable node.
root@primary1:~# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
primary1 Ready control-plane 7h23m v1.24.9 192.168.50.195 <none> Ubuntu 18.04.5 LTS 4.15.0-1080-ibm-gt containerd://1.6.8
primary2 Ready,SchedulingDisabled <none> 7h19m v1.23.11 192.168.50.197 <none> Ubuntu 18.04.5 LTS 4.15.0-1080-ibm-gt containerd://1.6.8
primary3 Ready <none> 7h19m v1.24.9 192.168.50.199 <none> Ubuntu 18.04.5 LTS 4.15.0-1080-ibm-gt containerd://1.6.8
root@primary1:~# kubectl get pods -o wide -A
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system cilium-operator-59d78d96f4-blgx4 1/1 Running 0 5h48m 192.168.50.199 primary3 <none> <none>
kube-system cilium-operator-59d78d96f4-khndr 1/1 Running 0 3h52m 192.168.50.197 primary2 <none> <none>
root@primary1:~# kubectl drain primary2 --ignore-daemonsets
node/primary2 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/cilium-phlcx, kube-system/kube-proxy-rqfn2
evicting pod kube-system/cilium-operator-59d78d96f4-khndr
pod/cilium-operator-59d78d96f4-khndr evicted
node/primary2 drained
root@primary1:~# kubectl get pods -o wide -A
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
kube-system cilium-operator-59d78d96f4-blgx4 1/1 Running 0 5h52m 192.168.50.199 primary3 <none> <none>
kube-system cilium-operator-59d78d96f4-sclwl 1/1 Running 0 98s 192.168.50.197 primary2 <none> <none>
Cilium Version
I'm running an old 1.11.1
, but the Helm chart problem is on all releases.
Kernel Version
n/a
Kubernetes Version
1.24.9, but like Cilium pretty much all versions for the past four years are affected.
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status