-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
1. What kops
version are you running? The command kops version
, will display
this information.
➜ kops version
Version 1.18.0 (git-698bf974d8)
2. What Kubernetes version are you running? kubectl version
will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops
flag.
➜ kubectl version --short
Client Version: v1.19.0
Server Version: v1.18.8
3. What cloud provider are you using?
AWS GovCloud
4. What commands did you run? What is the simplest way to reproduce this issue?
This was done as part of an upgrade from Kubernetes v1.17.10 created using kops v1.17.1 to Kubernetes v1.18.8 using kops v1.18.0. We upgraded the cluster and removed the pinned Docker version (originally added to fix Docker issues on Amazon Linux). So it went something like the following:
kops upgrade cluster --yes
kops edit cluster # removed the docker stuff
kops update cluster --yes
kops rolling-update cluster --yes
5. What happened after the commands executed?
Everything came up again smoothly except the role
labels weren't being added to the worker nodes. Looking into the kops-controller
, we were getting errors like this:
I0901 17:32:15.966765 1 s3context.go:325] unable to read /sys/devices/virtual/dmi/id/product_uuid, assuming not running on EC2: open /sys/devices/virtual/dmi/id/product_uuid: permission denied
I0901 17:32:15.967033 1 s3context.go:170] defaulting region to "us-east-1"
I0901 17:32:16.311839 1 s3context.go:191] unable to get bucket location from region "us-east-1"; scanning all regions: InvalidToken: The provided token is malformed or otherwise invalid.
status code: 400, request id: 4Z4P1MCS0S6K8P0R, host id: HOST-ID
E0901 17:32:16.647080 1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="unable to load cluster object for node ip-172-20-40-221.us-gov-west-1.compute.internal: error loading Cluster \"s3://cipher
morph-com-k8s-local-kops/ciphermorph.com.k8s.local/cluster.spec\": Unable to list AWS regions: AuthFailure: AWS was not able to validate the provided access credentials\n\tstatus code: 401, request id: 870266f7-a2ac-417e-9fb3-e7496e5ba6
29" "controller"="node" "request"={"Namespace":"","Name":"ip-172-20-40-221.us-gov-west-1.compute.internal"}
6. What did you expect to happen?
I expected the kops-controller
to be able to determine I'm running in by reading that file and not error out. Therefore, the kops-controller
would be able to proceed and add the labels to the node.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2020-04-11T19:12:40Z"
generation: 7
name: our.cluster.k8s.local
spec:
additionalPolicies:
node: |
[
{"Effect":"Allow","Action":["autoscaling:DescribeAutoScalingGroups","autoscaling:DescribeAutoScalingInstances","autoscaling:DescribeLaunchConfigurations","autoscaling:DescribeTags","autoscaling:SetDesiredCapacity","autoscaling:TerminateInstanceInAutoScalingGroup"],"Resource":"*"}
]
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://our-bucket/our.cluster.k8s.local
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-us-gov-west-1a
name: a
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-us-gov-west-1a
name: a
memoryRequest: 100Mi
name: events
iam:
allowContainerRegistry: true
legacy: false
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.18.8
masterInternalName: api.internal.our.cluster.k8s.local
masterPublicName: api.our.cluster.k8s.local
networkCIDR: 172.172.0.0/16
networking:
kubenet: {}
nonMasqueradeCIDR: 127.0.0.1/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.172.172.0/19
name: us-gov-west-1a
type: Public
zone: us-gov-west-1a
topology:
dns:
type: Public
masters: public
nodes: public
8. Please run the commands with most verbose logging by adding the -v 10
flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
This is running in AWS GovCloud region us-gov-west-1
.
Digging into the code, I found the place where this check it happening. By setting the AWS_REGION
environment variable in the Daemonset, I was able to get things working. However, this won't be a long-term solution since I assume this will get overwritten when we do another upgrade.
I looked at the file /sys/devices/virtual/dmi/id/product_uuid
directly on the node since it is the file the controller can't open and saw that it is owned by root
. So Initially I was thinking maybe Amazon Linux makes root
own it but it isn't on other distros. This is incorrect. I spun up an Ubuntu instance, and it too had that file owned by root.
I also tried updating to the most recent AMI of Amazon Linux, but that didn't fix anything either.