Skip to content

Conversation

johngmyers
Copy link
Member

Fixes #10139

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johngmyers

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/provider/aws Issues or PRs related to aws provider approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 30, 2020
@johngmyers
Copy link
Member Author

/retest

@seh
Copy link
Contributor

seh commented Oct 30, 2020

I built kops including this patch, and it worked as intended: my worker machines were able to contact kops-controller on port 3988 and finish their bootstrap procedure.

@johngmyers johngmyers changed the title WIP Open ELB to kops-controller port when using it for internal API Open ELB to kops-controller port when using it for internal API Oct 30, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 30, 2020
@seh
Copy link
Contributor

seh commented Oct 31, 2020

I built kops including this patch, and it worked as intended: my worker machines were able to contact kops-controller on port 3988 and finish their bootstrap procedure.

I spoke too soon. It seemed that this worked earlier, but then failed again in my current test. What I observed was that opening up the ELB's security group to allow ingress on port 3988 from the "nodes" security group wasn't good enough. Opening ingress on port 3988 from everywhere did work.

I don't understand why this is so. The source machine—one of my worker machines—is a member of the "nodes" security group. It also a member of another security group, but—not that I expected this to work—when I also tried allowing ingress to the ELB from that other security group, it didn't change the outcome. Only opening up the ELB to ingress from all sources has worked so far.

I can't tell if there's some SNAT going on here that's confusing AWS's ability to tell that the incoming traffic (that is, from a worker machine to the ELB) is coming from a blessed security group.

@seh
Copy link
Contributor

seh commented Oct 31, 2020

I enabled access logs on the ELB, but the client IP addresses only show the IP addresses of the ELB listeners, which isn't helpful. There are a few clients with an address like 18.188.58.156 which does not look like one from any of my VPCs.

I tried replacing the ELB security group rule allowing ingress from anywhere with one allowing ingress from the same ELB security group itself, just to see if this firewall enforcement matches what the ELB logs show. That didn't work, though: traffic was still blocked from the worker machines.

I'm going to take a break and get some sleep, and see if reasonable explanation comes to me—as usually occurs the moment I turn off the computer.

@hakman
Copy link
Member

hakman commented Oct 31, 2020

Your problem seems to be that the ELB is public and the SGs will not help too much with that. Probably 18.188.58.156 is VPC NAT GW address.

FromPort: fi.Int64(wellknownports.KopsControllerPort),
Protocol: fi.String("tcp"),
SecurityGroup: lbSG,
SourceGroup: nodeGroup.Task,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check if the API server load balancer is public, and if so, use a more lax source range here? As @hakman realized, for a public ELB, the traffic comes in from the nodes via a NAT gateway with a source address that's not even within any of the VPC's CIDR blocks. (The NAT gateways each have a private IP address within the VPC, but the corresponding "Elastic IP address" is outside the VPC.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to expose the port externally if possible. While the authentication is strong, there are denial of service considerations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per #10139 (comment), I now think we're fixing the wrong problem here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone asks for Public LB and UseForInternalApi, I think should allow access from 0.0.0.0/0, same as for API.
Or maybe not allow UseForInternalApi in this case, as it won't work anyway.
I don't have a strong preference in general here, so feel free to ignore my comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I understand your second proposal: Could kops reject "useForInternalApi" as invalid when the load balancer is public? I think that's the best choice, if it's feasible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the option—clearly, as I tried to enable it—for accessing the Kubernetes API servers, not really thinking about whether the load balancer was public or not. At that point, though, I didn't realize that it would be used for other things like this node bootstrapping.

And did this other use start in kops 1.19? I didn't experience this problem with kops 1.18.2.

@johngmyers johngmyers changed the title Open ELB to kops-controller port when using it for internal API WIP Open ELB to kops-controller port when using it for internal API Oct 31, 2020
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 31, 2020
@johngmyers
Copy link
Member Author

johngmyers commented Oct 31, 2020

Looks like the choices might be to expose the port externally, pay for a second, internal, load balancer, or to use a dns-controller domain regardless of the UseForInternalAPI setting.

@johngmyers
Copy link
Member Author

Going with a separate dns-controller-managed domain instead, in #10239

@johngmyers johngmyers closed this Nov 14, 2020
@johngmyers johngmyers deleted the internal-api-elb branch November 15, 2020 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enabling "spec.api.loadBalancer.useForInternalApi" requires access to kops controller port through API load balancer
4 participants