Skip to content

Change the default taints that Cilium tolerates to avoid deploying to a drained node #40475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 10, 2025

Conversation

parlakisik
Copy link
Contributor

@parlakisik parlakisik commented Jul 10, 2025

[ upstream commit 589164b ]
Cilium-operator pod automatically rescheduled onto drained mode. This can cause to block some kubernetes upgrades.

Default tolerations for cilium operator are updated with these values defined below.

   - key: "node-role.kubernetes.io/control-plane"
      operator: Exists
    - key: "node-role.kubernetes.io/master" #deprecated
      operator: Exists
    - key: "node.kubernetes.io/not-ready"
      operator: Exists

These are test results

// Testing template output 
#helm template cilium ./cilium --debug
. . . . . 
      hostNetwork: true
      restartPolicy: Always
      priorityClassName: system-cluster-critical
      serviceAccountName: "cilium-operator"
      automountServiceAccountToken: true
      # In HA mode, cilium-operator pods must not be scheduled on the same
      # node as they will clash with each other.
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                io.cilium/app: operator
            topologyKey: kubernetes.io/hostname
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
        - key: "node-role.kubernetes.io/control-plane"
          operator: Exists
        - key: "node-role.kubernetes.io/master" #deprecated
          operator: Exists
        - key: "node.kubernetes.io/not-ready"
          operator: Exists
        - key: node.cilium.io/agent-not-ready
          operator: Exists
      volumes:
        # To read the configuration from the config map
      - name: cilium-config-path
        configMap:
          name: cilium-config
....

// Testing installation and verification 
#helm install cilium ./install/kubernetes/cilium
#kubectl describe deployment cilium-operator
Name:                   cilium-operator
Namespace:              default
CreationTimestamp:      Thu, 19 Jun 2025 16:58:38 -0700
Labels:                 app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=cilium-operator
                        app.kubernetes.io/part-of=cilium
                        io.cilium/app=operator
                        name=cilium-operator
Annotations:            deployment.kubernetes.io/revision: 1
                        meta.helm.sh/release-name: cilium
                        meta.helm.sh/release-namespace: default
Selector:               io.cilium/app=operator,name=cilium-operator
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  50% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/name=cilium-operator
                    app.kubernetes.io/part-of=cilium
                    io.cilium/app=operator
                    name=cilium-operator
  Annotations:      prometheus.io/port: 9963
                    prometheus.io/scrape: true
  Service Account:  cilium-operator
  Containers:
   cilium-operator:
    Image:      quay.io/cilium/operator-generic-ci:latest
    Port:       9963/TCP
    Host Port:  9963/TCP
    Command:
      cilium-operator-generic
    Args:
      --config-dir=/tmp/cilium/config-map
      --debug=$(CILIUM_DEBUG)
    Liveness:   http-get http://127.0.0.1:9234/healthz delay=60s timeout=3s period=10s #success=1 #failure=3
    Readiness:  http-get http://127.0.0.1:9234/healthz delay=0s timeout=3s period=5s #success=1 #failure=5
    Environment:
      K8S_NODE_NAME:          (v1:spec.nodeName)
      CILIUM_K8S_NAMESPACE:   (v1:metadata.namespace)
      CILIUM_DEBUG:          <set to the key 'debug' of config map 'cilium-config'>  Optional: true
    Mounts:
      /tmp/cilium/config-map from cilium-config-path (ro)
  Volumes:
   cilium-config-path:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               cilium-config
    Optional:           false
  Priority Class Name:  system-cluster-critical
  Node-Selectors:       kubernetes.io/os=linux
  Tolerations:          node-role.kubernetes.io/control-plane op=Exists
                        node-role.kubernetes.io/master op=Exists
                        node.cilium.io/agent-not-ready op=Exists
                        node.kubernetes.io/not-ready op=Exists
Conditions:

Fixes: #28549

Helm: Adding tolerations to block cilium-operator deployment into drained nodes

40137

[ upstream commit 589164b ]

Cilium-operator pod automatically rescheduled onto drained mode.
This can cause to block some kubernete upgrades.

Signed-off-by: Murat Parlakisik <parlakisik@gmail.com>
@parlakisik parlakisik requested a review from a team as a code owner July 10, 2025 18:44
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.18 This PR represents a backport for Cilium 1.18.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Jul 10, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jul 10, 2025
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks.

@joestringer joestringer changed the title Adding taint to block deployment to drained nodes Change the default taints that Cilium tolerates to avoid deploying to a drained node Jul 10, 2025
@joestringer
Copy link
Member

I changed the title to try to highlight the user impact more clearly in the release notes that will be generated.

@joestringer joestringer enabled auto-merge July 10, 2025 18:56
@joestringer
Copy link
Member

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 10, 2025
@joestringer joestringer added this pull request to the merge queue Jul 10, 2025
Merged via the queue into cilium:v1.18 with commit a4c830d Jul 10, 2025
68 of 69 checks passed
guettli added a commit to guettli/cilium that referenced this pull request Aug 13, 2025
…des get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: cilium#40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
guettli added a commit to guettli/cilium that referenced this pull request Aug 13, 2025
When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: cilium#40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
guettli added a commit to guettli/cilium that referenced this pull request Aug 13, 2025
When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: cilium#40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
guettli added a commit to guettli/cilium that referenced this pull request Aug 13, 2025
When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: cilium#40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
guettli added a commit to guettli/cilium that referenced this pull request Aug 13, 2025
When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: cilium#40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
github-merge-queue bot pushed a commit that referenced this pull request Aug 13, 2025
When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: #40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
rabelmervin pushed a commit to rabelmervin/cilium that referenced this pull request Aug 18, 2025
When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: cilium#40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
joamaki pushed a commit that referenced this pull request Aug 19, 2025
[ upstream commit 765ee79 ]

When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: #40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Aug 21, 2025
[ upstream commit 765ee79 ]

When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: #40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
Signed-off-by: Jussi Maki <jussi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.18 This PR represents a backport for Cilium 1.18.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants