Preventing unmanaged Cilium endpoints on newly-created nodes

## Proposal / RFE

See https://github.com/cilium/cilium/issues/15366 for one recent report of this.

**Problem**: when a nodepool scale-out event happens (often when there's a certain amount of unscheduled Pods waiting for resources), workloads are often scheduled on new nodes before the Cilium agent has a chance to pull and run. The result is Cilium-unmanaged endpoints. The agent image is several hundreds of MB large, so this occurs frequently enough to be noticeable, especially with some of our users that run at scale.

We've been discussing solutions to this internally for a while, but one of the working proposals is as follows:

Kubelet has the `--register-with-taints` parameter to taint its own node resource atomically when it first registers itself to the Kubernetes API. There is no window of time where the node can exist without these taints, so it's guaranteed to block scheduling (unless a Pod tolerates all taints, which we can't/won't account for). Notably, this also blocks `DaemonSets`, since those tolerate [only a specific list of taints](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations) Once the Cilium agent has started up successfully, it removes the specified taint, at which point workload scheduling can resume.

Practically, this means:

- During cluster or nodepool creation, the user specifies a Cilium-specific on-register taint, let's say `node.cilium.io/agent-not-ready`. This will ensure the taint is automatically applied to all newly-created nodes.
  - Azure AKS: `az aks nodepool add .. --node-taints node.cilium.io/agent-not-ready=true:NoSchedule` ([docs](https://docs.microsoft.com/en-us/cli/azure/aks/nodepool?view=azure-cli-latest#az_aks_nodepool_add))
    **Note**: AKS only allows specifying taints on specific node pools, not on the cluster as a whole. This needs to be taken into account during cluster planning. Since autoscaling nodepools are commonly run separately, this would not be a blocker, and existing clusters can be provisioned with new nodepools to take advantage of this feature when released.
  - Google GKE: `gcloud container clusters create cluster-name --node-taints node.cilium.io/agent-not-ready=true:NoSchedule` ([docs](https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints#creating_a_cluster_with_node_taints)) - can be set on either a cluster or nodepool level!
  - Amazon EKS: depends on the provisioning method, but bootstrap.sh has support for arbitrary kubelet arguments through `--kubelet-extra-args` ([docs](https://aws.amazon.com/blogs/opensource/improvements-eks-worker-node-provisioning)) and `eksctl` supports setting taints on `nodeGroups` ([docs](https://eksctl.io/usage/autoscaling/#scaling-up-from-0)).
- The Cilium agent DaemonSet would tolerate this taint, so it would be allowed to schedule first. Once the agent has started successfully, it removes the taint from its own node resource.
- All other Pods will now get the opportunity to schedule.
- (see below) On shutdown, reapply `node.cilium.io/agent-not-ready=true:NoSchedule` to the local node.
  @aanm This would be another angle to the problem `*.cilium_bak` tries to solve.

**Caveat**: Taints are not reapplied when Kubelet restarts, when the node reboots, or when the Cilium agent is restarted. In case of a full node reboot, or a Cilium agent restart, this still leaves a window for Pods to be scheduled without receiving Cilium identities. To combat this, the Cilium agent could taint its own node during shutdown, which would prevent workloads from being scheduled while the Cilium agent is down (including during early kubelet startup).

cc @bmcustodio @joestringer 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preventing unmanaged Cilium endpoints on newly-created nodes #16602

Proposal / RFE

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preventing unmanaged Cilium endpoints on newly-created nodes #16602

Description

Proposal / RFE

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions