-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
1. What kops
version are you running? The command kops version
, will display
this information.
Version 1.19.0-alpha.5 (git-921f3e7109a4871ec1cb0b0e512785533a24dbce)
2. What Kubernetes version are you running? kubectl version
will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops
flag.
1.19.3
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
- kops edit cluster
- Set the Cluster manifest's "spec.api.loadBalancer.useForInternalApi" field to true in an earlier version of kops, such as version 1.18.2.
- Use kops update cluster and kops rolling-update cluster to update the DNS record in Route 53 and get all the Kubernetes cluster's nodes to use it.
- Upgrade the master machines to kops version 1.19.0-alpha.5.
- Create a new EC2 instance, intending for it to join the Kubernetes cluster as a worker node (not a master node).
- Observe that the kops-configuration systemd service (or the nodeup program) can't finish its work, because it can't connect to the kops controller on the masters on TCP port 3988.
The error message looks like this:
error running task “BootstrapClient/BootstrapClient” (3m27s remaining to succeed): Post “https://api.internal.redacted-cluster-name::3988/bootstrap”: dial tcp 3.20.73.237:3988: connect: connection timed out
- Observe that the new machine can't join the cluster as a node.
5. What happened after the commands executed?
kops updated its machine bootstrapping scripts to attempt to connect to any of the master machines by way of the "internal" API server DNS record on port 3988. However, our API server load balancer (for now, a Classic ELB) has only one listener on port 443. It rejects inbound connections on port 3988.
Past that, even if you add another ELB listener on TCP port 3988, forwarding on to port 3988 on the target instances, you then run into the firewalls defined by our security group rules. The masters accept inbound traffic on port 3988 from the worker nodes in the cluster, but not from the load balancer.
I added a rule to the security group used by the master machines to allow traffic in from the load balancer on port 3988, but that still wasn't enough. The bootstrap procedure on the new worker machine continued trying to connect to the masters on port 3988 through the load balancer, but it never succeeded.
6. What did you expect to happen?
Either setting the Cluster manifest's "spec.api.loadBalancer.useForInternalApi" field to true would induce kops to adjust the API server load balancer's listeners and related security group rules to allow this traffic through or, better yet, use a different DNS record for this internal traffic on ports other than 443 (or whichever port we choose to serve the Kubernetes API).
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
Cluster manifest
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2020-10-29T20:45:51Z"
name: redacted-cluster-name
spec:
additionalSans:
- api.internal.redacted-internal-domain
api:
loadBalancer:
additionalSecurityGroups:
- sg-0ef9de1ab14bee565
crossZoneLoadBalancing: true
idleTimeoutSeconds: 3600
type: Public
useForInternalApi: true
authorization:
rbac: {}
channel: stable
cloudConfig:
disableSecurityGroupIngress: true
cloudProvider: aws
configBase: s3://redacted-kops-state/redacted-cluster-name
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-east-2a
name: a
- instanceGroup: master-us-east-2b
name: b
- instanceGroup: master-us-east-2c
name: c
name: main
- etcdMembers:
- instanceGroup: master-us-east-2a
name: a
- instanceGroup: master-us-east-2b
name: b
- instanceGroup: master-us-east-2c
name: c
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
featureGates:
EphemeralContainers: "true"
kubeProxy:
proxyMode: ipvs
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
featureGates:
EphemeralContainers: "true"
kubernetesVersion: 1.19.3
masterInternalName: api.internal.redacted-cluster-name
masterPublicName: api.redacted-cluster-name
metricsServer:
enabled: true
networkCIDR: 10.3.0.0/16
networkID: vpc-077913405bb66aac2
networking:
calico:
crossSubnet: true
typhaReplicas: 3
nonMasqueradeCIDR: 100.64.0.0/10
subnets:
- cidr: 10.3.100.0/22
id: subnet-081bbc33a2014ac94
name: utility-us-east-2a
type: Utility
zone: us-east-2a
- cidr: 10.3.104.0/22
id: subnet-028e96931b1a7e0db
name: utility-us-east-2b
type: Utility
zone: us-east-2b
- cidr: 10.3.108.0/22
id: subnet-07d72675b6de4221d
name: utility-us-east-2c
type: Utility
zone: us-east-2c
- cidr: 10.3.0.0/22
egress: nat-003b528448366f901
id: subnet-08e5f928a1aa92d13
name: us-east-2a
type: Private
zone: us-east-2a
- cidr: 10.3.4.0/22
egress: nat-04cbc0b316dc7407f
id: subnet-0ba5ffa2b5a4383a4
name: us-east-2b
type: Private
zone: us-east-2b
- cidr: 10.3.8.0/22
egress: nat-017dff05abb092a02
id: subnet-0b3b58fecd5dd7373
name: us-east-2c
type: Private
zone: us-east-2c
topology:
dns:
type: Public
masters: private
nodes: private
8. Please run the commands with most verbose logging by adding the -v 10
flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Beyond the error message above found in the journalctl output, I don't have anything else to share here.
9. Anything else do we need to know?
There are some serious gymnastics I have to go through to roll out the change the "spec.api.loadBalancer.useForInternalApi" field from false to true. It involves generating the Terraform configuration, importing the existing "internal" DNS record as a Terraform resource, tainting that new resource, applying the configuration, but then while kops rolling-update cluster is running, in the background I have to run terraform apply -target <internal dns record address> repeatedly until kops rolling-update cluster finishes, because kops will keep updating the DNS record with new IP addresses of the master servers as they arrive.
I can share the shell code fragments for this if you want more detail. Though I've gotten this to work, it's fragile during the upgrade. I think this feature may be too hard to use, especially if not enabled at cluster creation time.