Skip to content

fleetagent running on downstream cluster cannot communicate back to Rancher server after restore #33954

@snasovich

Description

@snasovich

Found as part of working on #32599

Rancher Server Setup

  • Rancher version: master-head
  • Installation option (Docker install/Helm Chart): rancher/rancher code running locally against k3s cluster

Information about the Cluster

  • Kubernetes version: v1.19.8+k3s1 (upstream), v1.21.3+rke2r1 (downstream)

Describe the bug
Following restore of backed up cluster with RKE2-provisioned downstream cluster, fleetagent running on downstream cluster fails to communicate back to Rancher server as evident by 401s responses to requests it sends to Rancher API as shown below:
image

To Reproduce

  1. Provision DigitalOcean RKE2 downstream cluster. The issue may not require RKE2-provisioned cluster, but was found with it.
  2. Using "Rancher Backup" app perform a backup of the cluster. Note that restore of RKE2-provision clusters is being worked on as part of RKE2 Provisioning: Backup - Support Rancher backups with RKE2 provisioned clusters #32599 and so far the following needs to be added to Backup/Restore resourceset to backup related objects:
   # Added for v2 provisioning
  - apiVersion: apiextensions.k8s.io/v1
    kindsRegexp: .
    resourceNameRegexp: provisioning.cattle.io$|rke-machine-config.cattle.io$|rke-machine.cattle.io$|rke.cattle.io$
  - apiVersion: provisioning.cattle.io/v1
    kindsRegexp: .
  - apiVersion: rke-machine-config.cattle.io/v1
    kindsRegexp: .
  - apiVersion: rke-machine.cattle.io/v1
    kindsRegexp: .
  - apiVersion: rke.cattle.io/v1
    kindsRegexp: .
  - apiVersion: apiextensions.k8s.io/v1
    kindsRegexp: .
    resourceNameRegexp: cluster.x-k8s.io$
  - apiVersion: cluster.x-k8s.io/v1alpha4
    kindsRegexp: .
  # The below will backup unnecessary default-token-... secret
  - apiVersion: v1
    kindsRegexp: ^secrets$
    namespaces:
    - fleet-default
  1. Follow the instructions to restore Rancher to a new cluster per https://rancher.com/docs/rancher/v2.x/en/backups/v2.5/migrating-rancher/ (ensure source upstream cluster and Rancher is stopped).

Result
After starting Rancher against the restored cluster there are 401-code responses to requests sent by fleetagent as per the screenshot above. This indicates fleetagent is unable to communicate to the upstream Rancher server.

Expected Result
fleetagent communicates back to upstream Rancher without issues.

Additional context
The issue is almost certainly caused by fleetagent on downstream cluster persisting token from secret associated to a service account on upsteam cluster. Following restore, this token is no longer valid. Unfortunately, there doesn't seem to be a way to migrate service accounts to a new cluster in a way that token generated on the source cluster to remain valid (https://stackoverflow.com/q/65580643/16564280).
One option to address the issue would be to somehow trigger fleetagent to re-acquire token from cluster on restore.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions