Skip to content

kops-controller can't determine AWS region #9856

@w3irdrobot

Description

@w3irdrobot

1. What kops version are you running? The command kops version, will display
this information.

➜ kops version
Version 1.18.0 (git-698bf974d8)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

➜ kubectl version --short
Client Version: v1.19.0
Server Version: v1.18.8

3. What cloud provider are you using?

AWS GovCloud

4. What commands did you run? What is the simplest way to reproduce this issue?

This was done as part of an upgrade from Kubernetes v1.17.10 created using kops v1.17.1 to Kubernetes v1.18.8 using kops v1.18.0. We upgraded the cluster and removed the pinned Docker version (originally added to fix Docker issues on Amazon Linux). So it went something like the following:

kops upgrade cluster --yes
kops edit cluster # removed the docker stuff
kops update cluster --yes
kops rolling-update cluster --yes

5. What happened after the commands executed?

Everything came up again smoothly except the role labels weren't being added to the worker nodes. Looking into the kops-controller, we were getting errors like this:

I0901 17:32:15.966765       1 s3context.go:325] unable to read /sys/devices/virtual/dmi/id/product_uuid, assuming not running on EC2: open /sys/devices/virtual/dmi/id/product_uuid: permission denied
I0901 17:32:15.967033       1 s3context.go:170] defaulting region to "us-east-1"
I0901 17:32:16.311839       1 s3context.go:191] unable to get bucket location from region "us-east-1"; scanning all regions: InvalidToken: The provided token is malformed or otherwise invalid.
     status code: 400, request id: 4Z4P1MCS0S6K8P0R, host id: HOST-ID
 E0901 17:32:16.647080       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="unable to load cluster object for node ip-172-20-40-221.us-gov-west-1.compute.internal: error loading Cluster \"s3://cipher
 morph-com-k8s-local-kops/ciphermorph.com.k8s.local/cluster.spec\": Unable to list AWS regions: AuthFailure: AWS was not able to validate the provided access credentials\n\tstatus code: 401, request id: 870266f7-a2ac-417e-9fb3-e7496e5ba6
 29"  "controller"="node" "request"={"Namespace":"","Name":"ip-172-20-40-221.us-gov-west-1.compute.internal"}

6. What did you expect to happen?

I expected the kops-controller to be able to determine I'm running in by reading that file and not error out. Therefore, the kops-controller would be able to proceed and add the labels to the node.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-04-11T19:12:40Z"
  generation: 7
  name: our.cluster.k8s.local
spec:
  additionalPolicies:
    node: |
      [
        {"Effect":"Allow","Action":["autoscaling:DescribeAutoScalingGroups","autoscaling:DescribeAutoScalingInstances","autoscaling:DescribeLaunchConfigurations","autoscaling:DescribeTags","autoscaling:SetDesiredCapacity","autoscaling:TerminateInstanceInAutoScalingGroup"],"Resource":"*"}
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://our-bucket/our.cluster.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-us-gov-west-1a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-us-gov-west-1a
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.18.8
  masterInternalName: api.internal.our.cluster.k8s.local
  masterPublicName: api.our.cluster.k8s.local
  networkCIDR: 172.172.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 127.0.0.1/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.172.172.0/19
    name: us-gov-west-1a
    type: Public
    zone: us-gov-west-1a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

This is running in AWS GovCloud region us-gov-west-1.

Digging into the code, I found the place where this check it happening. By setting the AWS_REGION environment variable in the Daemonset, I was able to get things working. However, this won't be a long-term solution since I assume this will get overwritten when we do another upgrade.

I looked at the file /sys/devices/virtual/dmi/id/product_uuid directly on the node since it is the file the controller can't open and saw that it is owned by root. So Initially I was thinking maybe Amazon Linux makes root own it but it isn't on other distros. This is incorrect. I spun up an Ubuntu instance, and it too had that file owned by root.

I also tried updating to the most recent AMI of Amazon Linux, but that didn't fix anything either.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions