Skip to content

Preventing unmanaged Cilium endpoints on newly-created nodes #16602

@ti-mo

Description

@ti-mo

Proposal / RFE

See #15366 for one recent report of this.

Problem: when a nodepool scale-out event happens (often when there's a certain amount of unscheduled Pods waiting for resources), workloads are often scheduled on new nodes before the Cilium agent has a chance to pull and run. The result is Cilium-unmanaged endpoints. The agent image is several hundreds of MB large, so this occurs frequently enough to be noticeable, especially with some of our users that run at scale.

We've been discussing solutions to this internally for a while, but one of the working proposals is as follows:

Kubelet has the --register-with-taints parameter to taint its own node resource atomically when it first registers itself to the Kubernetes API. There is no window of time where the node can exist without these taints, so it's guaranteed to block scheduling (unless a Pod tolerates all taints, which we can't/won't account for). Notably, this also blocks DaemonSets, since those tolerate only a specific list of taints Once the Cilium agent has started up successfully, it removes the specified taint, at which point workload scheduling can resume.

Practically, this means:

  • During cluster or nodepool creation, the user specifies a Cilium-specific on-register taint, let's say node.cilium.io/agent-not-ready. This will ensure the taint is automatically applied to all newly-created nodes.
    • Azure AKS: az aks nodepool add .. --node-taints node.cilium.io/agent-not-ready=true:NoSchedule (docs)
      Note: AKS only allows specifying taints on specific node pools, not on the cluster as a whole. This needs to be taken into account during cluster planning. Since autoscaling nodepools are commonly run separately, this would not be a blocker, and existing clusters can be provisioned with new nodepools to take advantage of this feature when released.
    • Google GKE: gcloud container clusters create cluster-name --node-taints node.cilium.io/agent-not-ready=true:NoSchedule (docs) - can be set on either a cluster or nodepool level!
    • Amazon EKS: depends on the provisioning method, but bootstrap.sh has support for arbitrary kubelet arguments through --kubelet-extra-args (docs) and eksctl supports setting taints on nodeGroups (docs).
  • The Cilium agent DaemonSet would tolerate this taint, so it would be allowed to schedule first. Once the agent has started successfully, it removes the taint from its own node resource.
  • All other Pods will now get the opportunity to schedule.
  • (see below) On shutdown, reapply node.cilium.io/agent-not-ready=true:NoSchedule to the local node.
    @aanm This would be another angle to the problem *.cilium_bak tries to solve.

Caveat: Taints are not reapplied when Kubelet restarts, when the node reboots, or when the Cilium agent is restarted. In case of a full node reboot, or a Cilium agent restart, this still leaves a window for Pods to be scheduled without receiving Cilium identities. To combat this, the Cilium agent could taint its own node during shutdown, which would prevent workloads from being scheduled while the Cilium agent is down (including during early kubelet startup).

cc @bmcustodio @joestringer

Metadata

Metadata

Assignees

Labels

area/cniImpacts the Container Networking Interface between Cilium and the orchestrator.integration/cloudRelated to integration with cloud environments such as AKS, EKS, GKE, etc.kind/featureThis introduces new functionality.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions