-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Proposal / RFE
See #15366 for one recent report of this.
Problem: when a nodepool scale-out event happens (often when there's a certain amount of unscheduled Pods waiting for resources), workloads are often scheduled on new nodes before the Cilium agent has a chance to pull and run. The result is Cilium-unmanaged endpoints. The agent image is several hundreds of MB large, so this occurs frequently enough to be noticeable, especially with some of our users that run at scale.
We've been discussing solutions to this internally for a while, but one of the working proposals is as follows:
Kubelet has the --register-with-taints
parameter to taint its own node resource atomically when it first registers itself to the Kubernetes API. There is no window of time where the node can exist without these taints, so it's guaranteed to block scheduling (unless a Pod tolerates all taints, which we can't/won't account for). Notably, this also blocks DaemonSets
, since those tolerate only a specific list of taints Once the Cilium agent has started up successfully, it removes the specified taint, at which point workload scheduling can resume.
Practically, this means:
- During cluster or nodepool creation, the user specifies a Cilium-specific on-register taint, let's say
node.cilium.io/agent-not-ready
. This will ensure the taint is automatically applied to all newly-created nodes.- Azure AKS:
az aks nodepool add .. --node-taints node.cilium.io/agent-not-ready=true:NoSchedule
(docs)
Note: AKS only allows specifying taints on specific node pools, not on the cluster as a whole. This needs to be taken into account during cluster planning. Since autoscaling nodepools are commonly run separately, this would not be a blocker, and existing clusters can be provisioned with new nodepools to take advantage of this feature when released. - Google GKE:
gcloud container clusters create cluster-name --node-taints node.cilium.io/agent-not-ready=true:NoSchedule
(docs) - can be set on either a cluster or nodepool level! - Amazon EKS: depends on the provisioning method, but bootstrap.sh has support for arbitrary kubelet arguments through
--kubelet-extra-args
(docs) andeksctl
supports setting taints onnodeGroups
(docs).
- Azure AKS:
- The Cilium agent DaemonSet would tolerate this taint, so it would be allowed to schedule first. Once the agent has started successfully, it removes the taint from its own node resource.
- All other Pods will now get the opportunity to schedule.
- (see below) On shutdown, reapply
node.cilium.io/agent-not-ready=true:NoSchedule
to the local node.
@aanm This would be another angle to the problem*.cilium_bak
tries to solve.
Caveat: Taints are not reapplied when Kubelet restarts, when the node reboots, or when the Cilium agent is restarted. In case of a full node reboot, or a Cilium agent restart, this still leaves a window for Pods to be scheduled without receiving Cilium identities. To combat this, the Cilium agent could taint its own node during shutdown, which would prevent workloads from being scheduled while the Cilium agent is down (including during early kubelet startup).