Enabling "spec.api.loadBalancer.useForInternalApi" requires access to kops controller port through API load balancer

**1. What `kops` version are you running? The command `kops version`, will display
 this information.**

Version 1.19.0-alpha.5 (git-921f3e7109a4871ec1cb0b0e512785533a24dbce)

**2. What Kubernetes version are you running? `kubectl version` will print the
 version if a cluster is running or provide the Kubernetes version specified as
 a `kops` flag.**

1.19.3

**3. What cloud provider are you using?**

AWS

**4. What commands did you run?  What is the simplest way to reproduce this issue?**

1. _kops edit cluster_
1. Set the _Cluster_ manifest's ["spec.api.loadBalancer.useForInternalApi" field](https://github.com/kubernetes/kops/blob/v1.18.2/pkg/apis/kops/cluster.go#L363-L364) to true in an earlier version of _kops_, such as version 1.18.2.
1. Use _kops update cluster_ and _kops rolling-update cluster_ to update the DNS record in Route 53 and get all the Kubernetes cluster's nodes to use it.
1. Upgrade the master machines to _kops_ version 1.19.0-alpha.5.
1. Create a new EC2 instance, intending for it to join the Kubernetes cluster as a **worker node** (not a master node).
1. Observe that the _kops-configuration_ systemd service (or the _nodeup_ program) can't finish its work, because it can't connect to the _kops_ controller on the masters on TCP port 3988.  
  The error message looks like this:
  `error running task “BootstrapClient/BootstrapClient” (3m27s remaining to succeed): Post “https://api.internal.redacted-cluster-name::3988/bootstrap”: dial tcp 3.20.73.237:3988: connect: connection timed out`
1. Observe that the new machine can't join the cluster as a node.

**5. What happened after the commands executed?**

_kops_ updated its machine bootstrapping scripts to attempt to connect to any of the master machines by way of the "internal" API server DNS record on port 3988. However, our API server load balancer (for now, a Classic ELB) has only one listener on port 443. It rejects inbound connections on port 3988.

Past that, even if you add another ELB listener on TCP port 3988, forwarding on to port 3988 on the target instances, you then run into the firewalls defined by our security group rules. The masters accept inbound traffic on port 3988 from the worker nodes in the cluster, but not from the load balancer.

I added a rule to the security group used by the master machines to allow traffic in from the load balancer on port 3988, but that still wasn't enough. The bootstrap procedure on the new worker machine continued trying to connect to the masters on port 3988 through the load balancer, but it never succeeded.

**6. What did you expect to happen?**

Either setting the _Cluster_ manifest's "spec.api.loadBalancer.useForInternalApi" field to true would induce _kops_ to adjust the API server load balancer's listeners and related security group rules to allow this traffic through or, better yet, use a different DNS record for this internal traffic on ports other than 443 (or whichever port we choose to serve the Kubernetes API).

**7. Please provide your cluster manifest. Execute
  `kops get --name my.example.com -o yaml` to display your cluster manifest.
  You may want to remove your cluster name and other sensitive information.**

<details>
<summary>Cluster manifest</summary>

```yaml
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2020-10-29T20:45:51Z"
  name: redacted-cluster-name
spec:
  additionalSans:
  - api.internal.redacted-internal-domain
  api:
    loadBalancer:
      additionalSecurityGroups:
      - sg-0ef9de1ab14bee565
      crossZoneLoadBalancing: true
      idleTimeoutSeconds: 3600
      type: Public
      useForInternalApi: true
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    disableSecurityGroupIngress: true
  cloudProvider: aws
  configBase: s3://redacted-kops-state/redacted-cluster-name
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-east-2a
      name: a
    - instanceGroup: master-us-east-2b
      name: b
    - instanceGroup: master-us-east-2c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-east-2a
      name: a
    - instanceGroup: master-us-east-2b
      name: b
    - instanceGroup: master-us-east-2c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    featureGates:
      EphemeralContainers: "true"
  kubeProxy:
    proxyMode: ipvs
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    featureGates:
      EphemeralContainers: "true"
  kubernetesVersion: 1.19.3
  masterInternalName: api.internal.redacted-cluster-name
  masterPublicName: api.redacted-cluster-name
  metricsServer:
    enabled: true
  networkCIDR: 10.3.0.0/16
  networkID: vpc-077913405bb66aac2
  networking:
    calico:
      crossSubnet: true
      typhaReplicas: 3
  nonMasqueradeCIDR: 100.64.0.0/10
  subnets:
  - cidr: 10.3.100.0/22
    id: subnet-081bbc33a2014ac94
    name: utility-us-east-2a
    type: Utility
    zone: us-east-2a
  - cidr: 10.3.104.0/22
    id: subnet-028e96931b1a7e0db
    name: utility-us-east-2b
    type: Utility
    zone: us-east-2b
  - cidr: 10.3.108.0/22
    id: subnet-07d72675b6de4221d
    name: utility-us-east-2c
    type: Utility
    zone: us-east-2c
  - cidr: 10.3.0.0/22
    egress: nat-003b528448366f901
    id: subnet-08e5f928a1aa92d13
    name: us-east-2a
    type: Private
    zone: us-east-2a
  - cidr: 10.3.4.0/22
    egress: nat-04cbc0b316dc7407f
    id: subnet-0ba5ffa2b5a4383a4
    name: us-east-2b
    type: Private
    zone: us-east-2b
  - cidr: 10.3.8.0/22
    egress: nat-017dff05abb092a02
    id: subnet-0b3b58fecd5dd7373
    name: us-east-2c
    type: Private
    zone: us-east-2c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private
```
</details>

**8. Please run the commands with most verbose logging by adding the `-v 10` flag.
  Paste the logs into this report, or in a gist and provide the gist link here.**

Beyond the error message above found in the _journalctl_ output, I don't have anything else to share here.

**9. Anything else do we need to know?**

There are some serious gymnastics I have to go through to roll out the change the "spec.api.loadBalancer.useForInternalApi" field from false to true. It involves generating the Terraform configuration, importing the existing "internal" DNS record as a Terraform resource, tainting that new resource, applying the configuration, but then while _kops rolling-update cluster_ is running, in the background I have to run _terraform apply -target \<internal dns record address\>_ repeatedly until _kops rolling-update cluster_ finishes, because _kops_ will keep updating the DNS record with new IP addresses of the master servers as they arrive.

I can share the shell code fragments for this if you want more detail. Though I've gotten this to work, it's fragile during the upgrade. I think this feature may be too hard to use, especially if not enabled at cluster creation time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabling "spec.api.loadBalancer.useForInternalApi" requires access to kops controller port through API load balancer #10139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enabling "spec.api.loadBalancer.useForInternalApi" requires access to kops controller port through API load balancer #10139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions