Skip to content

Conversation

guettli
Copy link
Contributor

@guettli guettli commented Aug 13, 2025

When Kubelet gets started with --cloud-provider=external, then new Nodes get the Taint 'node.cloudprovider.kubernetes.io/uninitialized'. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilium Operator needs a toleration for that taint. Otherwise, the first node (control-plane) in a cluster fails to start.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: #40475 This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request

  • All code is covered by unit and/or runtime tests where feasible.

Yes, a test with --cloud-provider=external (Kubelet argument) would be good. But not doable for me at the moment. I can assist if you want to. Maybe this could be done with Cluster-API and InfraProvider Docker.

  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.

Needed?

  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.

I tried, but I guess I do not have permission to set a reviewer. I wrote him via Slack.

Add toleration for 'node.cloudprovider.kubernetes.io/uninitialized' to Cilium Operator

@guettli guettli requested review from a team as code owners August 13, 2025 06:02
@guettli guettli requested review from tommyp1ckles and gandro August 13, 2025 06:02
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 13, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Aug 13, 2025
@guettli guettli force-pushed the add-toleration--node.cloudprovider.kubernetes.io branch from 7e856c2 to c9ebcae Compare August 13, 2025 06:06
@guettli
Copy link
Contributor Author

guettli commented Aug 13, 2025

There was a network issue:

Error: signing [quay.io/cilium/operator-generic-ci@sha256:a476690aa629d50e893f8236ff570dabb5a07558c79dfb1aa6ffa944ab7adf99]: getting signer: getting key from Fulcio: getting CTFE public keys: updating local metadata and targets: error updating to TUF remote mirror: tuf: failed to download 13.root.json: Get "https://tuf-repo-cdn.sigstore.dev/13.root.json": dial tcp 34.117.62.14:443: i/o timeout

Unfortunately I see no way to run the test again.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and context! Looks good to me!

@gandro
Copy link
Member

gandro commented Aug 13, 2025

Unfortunately I see no way to run the test again.

Only organization members can run/restart CI. I'll restart CI for you!

@gandro gandro added area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/bug This PR fixes an issue in a previous release of Cilium. area/helm Impacts helm charts and user deployment experience needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 13, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 13, 2025
@gandro
Copy link
Member

gandro commented Aug 13, 2025

I think you also have to update the auto-generated Helm value documentation. Run the following commands to achieve that:

make -C install/kubernetes && make -C Documentation update-helm-values 

checkpatch is also complaining about the long subject in the commit message, but this is an optional workflow an can be ignored. The subject is fine as is.

@parlakisik
Copy link
Contributor

  • please update documentation for this change too
  • did you run make -C install/kubernetesfor values.yaml . This should not be edited manually

@guettli guettli force-pushed the add-toleration--node.cloudprovider.kubernetes.io branch from c9ebcae to 4df5566 Compare August 13, 2025 06:57
When Kubelet gets started with --cloud-provider=external, then new Nodes get that annotation. CCM picks these new nodes and sets then ProviderID. But before CCM can start on the first control-plane of a new cluster, the CNI must be running. This means Cilum Operator needs a toleration for that taint.

Related: https://app.slack.com/client/T1MATJ4SZ/C53TG4J4R

In Cilium v1.17 the Cilium Operator had a toleration for all taints. This was changed in that PR: cilium#40475
This PR extends the list of tolerations.

Fixes: aa9a24c (Change the default taints that Cilium tolerates to avoid deploying to a drained node)

Signed-off-by: Thomas Guettler <thomas.guettler@syself.com>
@guettli guettli force-pushed the add-toleration--node.cloudprovider.kubernetes.io branch from 4df5566 to 39f4ff7 Compare August 13, 2025 06:58
@guettli
Copy link
Contributor Author

guettli commented Aug 13, 2025

I think you also have to update the auto-generated Helm value documentation. Run the following commands to achieve that:

make -C install/kubernetes && make -C Documentation update-helm-values 

Thank you for the hint. I only run make update-helm-values up to now.

I pushed after running your command. Please re-check

checkpatch is also complaining about the long subject in the commit message, but this is an optional workflow an can be ignored. The subject is fine as is.

I trimmed the git subject line.

@guettli
Copy link
Contributor Author

guettli commented Aug 13, 2025

@parlakisik @gandro The generation of the values.schema.json looks strange. There is anyOf, but it contains the same data again and again:

        "tolerations": {
          "items": {
            "anyOf": [
              {
                "properties": {
                  "key": {
                    "type": "string"
                  },
                  "operator": {
                    "type": "string"
                  }
                }
              },
              {
                "properties": {
                  "key": {
                    "type": "string"
                  },
                  "operator": {
                    "type": "string"
                  }
                }
              },
              {
                "properties": {
                  "key": {
                    "type": "string"
                  },
                  "operator": {
                    "type": "string"
                  }
                }
              },
              {
                "properties": {
                  "key": {
                    "type": "string"
                  },
                  "operator": {
                    "type": "string"
                  }
                }
              }
            ]

@guettli guettli requested a review from gandro August 13, 2025 07:02
@parlakisik
Copy link
Contributor

I think there is a bug in the schema generation. That we should investigate. that is normal for now .

@parlakisik
Copy link
Contributor

parlakisik commented Aug 13, 2025

Looks good to me!

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Yeah, both our helm docs and schema generators sometimes do strange things. We've noticed the strange schema in earlier PRs as well, but it doesn't prevent anyone from setting custom tolerations (I've tested that).

@gandro
Copy link
Member

gandro commented Aug 13, 2025

/test

@gandro gandro enabled auto-merge August 13, 2025 07:22
@guettli
Copy link
Contributor Author

guettli commented Aug 13, 2025

Thanks!

Yeah, both our helm docs and schema generators sometimes do strange things. We've noticed the strange schema in earlier PRs as well, but it doesn't prevent anyone from setting custom tolerations (I've tested that).

I created an issue. Maybe someone wants to fix that: values.schema.json contain duplicate data · Issue #41099 · cilium/cilium

@gandro gandro added this pull request to the merge queue Aug 13, 2025
Merged via the queue into cilium:main with commit 765ee79 Aug 13, 2025
78 of 83 checks passed
@joamaki joamaki mentioned this pull request Aug 19, 2025
19 tasks
@joamaki joamaki added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 19, 2025
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/helm Impacts helm charts and user deployment experience area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. kind/community-contribution This was a contribution made by a community member. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants