Skip to content

[BUG] Managed RKE2 clusters are broken after upgrade to 2.9.1 when KDM is not updated to release-v2.9 #46855

@riuvshyn

Description

@riuvshyn

Rancher Server Setup

  • Rancher version: 2.9.1
  • Installation option (Docker install/Helm Chart):
    • Helm chart
    • RKE2
    • 1.28.12

Information about the Cluster

  • Kubernetes version:
  • Cluster Type (Local/Downstream):
    • downstream RKE2 custom clusters

User Information

  • What is the role of the user logged in?: Admin

Describe the bug
After upgrade from 2.8.3 rancher to 2.9.1 managed RKE2 custom clusters provisioned using rancher2_cluster_v2 are broken because /etc/rancher/rke2/config.yaml.d/50-rancher.yaml configuration provided is completely ignored and was replaced most likely with a default parameters.

To Reproduce

  1. provision managed rke2 1.28.x cluster with terraform rancher2_cluster_v2 resource on rancher 2.8.3
  2. Upgrade Rancher to latest 2.9.1

Result
After upgrade on a managed cluster rancher agent pods in cattle-system namespace getting replaced with new pods version v2.9.1 and then something happens so /etc/rancher/rke2/config.yaml.d/50-rancher.yaml config is being replaced with just a few lines which makes cluster completely broken because it removes configured cilium and other critical components.

Expected Result
Rancher upgrade to 2.9.1 does not brake managed RKE2 clusters.

Additional context
I have noticed that after upgrade some cluster critical components such as kyverno started failing with this:

{"level":"error","ts":1724780897.1700578,"logger":"klog","caller":"leaderelection/leaderelection.go:332","msg":"error retrieving resource lock kyverno/kyverno: Get \"https://100.64.0.1:443/apis/coordination.k8s.io/v1/namespaces/kyverno/leases/kyverno\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, ::1, 172.23.103.172, 172.23.103.172, 10.43.0.1, not 100.64.0.1","stacktrace":"k8s.io/client-go/tools/leaderelection.(*LeaderElector).tryAcquireOrRenew\n\tk8s.io/client-go@v0.29.0/tools/leaderelection/leaderelection.go:332\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).acquire.func1\n\tk8s.io/client-go@v0.29.0/tools/leaderelection/leaderelection.go:252\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:227\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:204\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).acquire\n\tk8s.io/client-go@v0.29.0/tools/leaderelection/leaderelection.go:251\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\tk8s.io/client-go@v0.29.0/tools/leaderelection/leaderelection.go:208\ngithub.com/kyverno/kyverno/pkg/leaderelection.(*config).Run\n\tgithub.com/kyverno/kyverno/pkg/leaderelection/leaderelection.go:136\nmain.main.func2\n\tgithub.com/kyverno/kyverno/cmd/kyverno/main.go:462"}

Also pods in cattle-system ns failing same way:

time="2024-08-27T17:48:43Z" level=error msg="error syncing 'cattle-system/apply-system-agent-upgrader-on-ip-172-23-102-146-with-073-56173': handler system-upgrade-controller: Get \"https://100.64.0.1:443/apis/upgrade.cattle.io/v1/namespaces/cattle-system/plans/system-agent-upgrader\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, ::1, 172.23.102.15, 172.23.102.15, 10.43.0.1, not 100.64.0.1, requeuing"

Metadata

Metadata

Labels

kind/bugIssues that are defects reported by users or that we know have reached a real releaseteam/hostbustersThe team that is responsible for provisioning/managing downstream clusters + K8s version support

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions