Skip to content

Conversation

Weil0ng
Copy link
Contributor

@Weil0ng Weil0ng commented Nov 3, 2020

This is for compatibility with Istio in kube-proxy free mode.
Currently, even though Istio would still get all traffic within pod
namespace, but the original service VIP is lost during socket lb,
causing it to miss all Istio routing chains and therefore bypassing
all Istio functionalities.

This adds a new option to bypass socket lb in pod namespace. When
enabled, service resolution for connection from pod namespaces will be
handled in bpf_lxc at veth. For host-namespaced pods, socket lb kept as
is.

Signed-off-by: Weilong Cui cuiwl@google.com

@Weil0ng Weil0ng requested review from brb, borkmann, joestringer and a team November 3, 2020 21:02
@Weil0ng Weil0ng requested a review from a team as a code owner November 3, 2020 21:02
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 3, 2020
Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

One comment, and also it would be good if you could extend the k8sT/Services.go to check whether in the kube-proxy-less mode we don't break anything when the flag is enabled.

bpf/bpf_sock.c Outdated
@@ -775,6 +780,11 @@ static __always_inline int __sock6_xlate_fwd(struct bpf_sock_addr *ctx,
bool backend_from_affinity = false;
__u32 backend_id = 0;

#ifdef SKIP_SK_LB_IN_POD_NS
if (!in_hostns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep in mind that availability of the in_hostns depends on the get_netns_cookie() kernel helper introduced in https://lore.kernel.org/bpf/cover.1585323121.git.daniel@iogearbox.net/T/. So, you might want to check for the helpers availability in the agent, and notify the user about its absence in the case they have enabled SkipSkLBInPodNS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is checked in ctx_in_hostns(). In the absence of such helpers, in_hostns will be set to true and we won't skip socket lb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the workaround ineffective on those particular kernels then? Or does this mean that host services will break when this flag is configured on those kernels?

If either of the above cases hold, it seems like that would be useful to notify the user about when they enable this flag and do not have the corresponding bpf helper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the workaround ineffective on those particular kernels then

This. Therefore, I'm suggesting to warn users via log msg about Istio rules being bypasses. If we make the flag less generic, i.e. Istio specific, then instead of warning we could panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logging in daemon/cmd/kube_proxy_replacement.go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will result in odd semantics, meaning, for the BPF kube-proxy replacement we'll end up skipping everything for Pods, but not for host ns. The latter will be fine, but for Pods given we compile in xlation in bpf_lxc, we'll have ClusterIP services, and NodePort we'll pass through. The BPF code would then translate this on remote node via N-S LB, but for example externalIPs or hostPort translation for local node will not happen. This means this feature will really only work when you run the regular kube-proxy on the side to cover the remaining corner cases, is that intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update the kube-proxy-free guide with the concrete semantics? Probably good fit under: https://docs.cilium.io/en/v1.9/gettingstarted/kubeproxy-free/#kube-proxy-hybrid-modes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I'm reviewing what we did/needs to be done for this PR now and realized I completely overlooked this comment somehow a while ago...so basically are you saying all "host-reachable" services for pod ns won't work if this option is turned on? @borkmann

@aanm aanm added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Nov 4, 2020
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 4, 2020

Thanks!

One comment, and also it would be good if you could extend the k8sT/Services.go to check whether in the kube-proxy-less mode we don't break anything when the flag is enabled.

I agree we should have test for this, a bit confused as what should be included in the tests? I assume the basic cluster IP connectivity should be good?

@aanm aanm removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Nov 10, 2020
@Weil0ng Weil0ng requested review from a team as code owners November 13, 2020 00:07
@Weil0ng Weil0ng requested review from a team and kaworu November 13, 2020 00:07
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 13, 2020

Thanks!

One comment, and also it would be good if you could extend the k8sT/Services.go to check whether in the kube-proxy-less mode we don't break anything when the flag is enabled.

Added two basic connectivity test with skipSocketLB set to true, PTAL :)

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor items to address. I didn't review very deeply and will defer to others on the approach implemented here. I mostly looked at the changes at a surface level.

Also, just a meandering thought, but the description of the flag strikes me as very in-the-weeds, and I'm not sure that users will read it and understand what it's for. However, I do realize that the flag itself controls "in-the-weeds" behavior inherently...

@aditighag
Copy link
Member

aditighag commented Nov 13, 2020

Currently, even though Isiot would still get all traffic within pod
namespace

Is this redirection achieved via iptable rules [1]?

[1] https://github.com/istio/istio/wiki/Proxy-redirection

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Nov 13, 2020

Currently, even though Isiot would still get all traffic within pod
namespace

Is this redirection achieved via iptable rules [1]?

[1] https://github.com/istio/istio/wiki/Proxy-redirection

yup

@Weil0ng Weil0ng force-pushed the netns branch 2 times, most recently from 46ad445 to 895f346 Compare November 16, 2020 23:41
status.ExpectSuccess("cannot curl to service IP from host")

// Test connectivity from pod ns
testCurlFromPods("id=app2", httpSVCURL, 10, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check existence of the cluster IP in the cilium monitor output here when we have the config option enabled, and service translation doesn't happen via BPF socket prog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed how the test works, turns out cilium monitor won't capture cluster IP in either case, now it's checking in the ct table, we should see service VIP in ct only when tc service resolution takes effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - @Weil0ng is going to add a check for the cluster IP in the monitor output in order to validate that the flag works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@kaworu kaworu requested a review from a team December 8, 2020 11:08
@kaworu kaworu removed their assignment Dec 8, 2020
@Weil0ng Weil0ng force-pushed the netns branch 2 times, most recently from 65e4722 to 8ae865c Compare December 14, 2020 20:34
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Dec 14, 2020

retest-net-next

@stale

This comment has been minimized.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 17, 2021
@borkmann borkmann added the pinned These issues are not marked stale by our issue bot. label Jan 18, 2021
@stale stale bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 18, 2021
@joestringer
Copy link
Member

What's the next steps on this PR? Are you looking for additional feedback @Weil0ng? Or you are currently addressing the outstanding feedback?

@Weil0ng
Copy link
Contributor Author

Weil0ng commented Mar 3, 2021

What's the next steps on this PR? Are you looking for additional feedback @Weil0ng? Or you are currently addressing the outstanding feedback?

I think the only thing left is the complexity issue, which I don't currently have a good solution. Last time we synced with @pchaigno , the general direction is to break bpf_lxc into more tail calls, but I think that probably should be tackled by another PR?

@pchaigno
Copy link
Member

pchaigno commented Mar 8, 2021

Let's maybe switch this PR to draft until we have fixed the complexity issue? I don't expect a lot of conflicts even if you rebase after a couple weeks so should be fine. I don't think anyone as cycles now to address the complexity issue, but it should get better after the feature freeze.

@brb
Copy link
Member

brb commented Mar 8, 2021

Ah right, yeah the only way we can do that today is with netns cookie which is available in Linux 5.7 or later. When kubernetes moves to cgroupv2 (KEP), it should be possible to perform this on older kernels.

Considering that the recent k8s (AFAIK 1.20) started supporting cgroupv2, maybe it's worth to revisit other approach to solve the problem, i.e. to identify Istio pods?

@Weil0ng Weil0ng marked this pull request as draft March 10, 2021 16:02
@Weil0ng
Copy link
Contributor Author

Weil0ng commented Apr 20, 2021

Sry for not tracking this lately, had a brief discussion during today's sync with @borkmann , idea is that if we can set SO_ORIGINAL_DST with a bpf sockopt program with cgroupv2, then maybe that's enough for istio, the redirection part is relatively straightforward, LRP can probably be used there. I'll investigate the sockopt capabilities and comment back.

@alex-berger
Copy link
Contributor

Just want to chime in as this also affects Linkerd (not just Istio).

--namespace kube-system \\
--set tunnel=disabled \\
--set autoDirectNodeRoutes=true \\
--set kubeProxyReplacement=strick \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo sneaked in here, shouldn't it be strict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*******************************************

Cilium has built-in support for bypassing the socket-level LoadBalancer and falling back
to the tc LoadBalancer when a custom redirection/operation relies on the original ClusterIP
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a tc LoadBalancer? Maybe clarify here if easily possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "at the veth interface" to help clarify, PTAL :)

This is for compatibility with Isio in kube-proxy free mode.
Currently, even though Isiot would still get all traffic within pod
namespace, but the original service VIP is lost during socket lb,
causing it to miss all Istio routing chains and therefore bypassing
all Istio functionalities.

This adds a new option to bypass socket lb in pod namespace. When
enabled, service resolution for connection from pod namespaces will be
handled in bpf_lxc at veth. For host-namespaced pods, socket lb kept as
is.

Signed-off-by: Weilong Cui <cuiwl@google.com>
@alex-berger
Copy link
Contributor

Any progress on this, is there anything we can help with to release this soonish (in 1.10.* or at least 1.11.*)?

@brb
Copy link
Member

brb commented Aug 24, 2021

Closing in favor of #17154. /cc @Weil0ng

@brb brb closed this Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned These issues are not marked stale by our issue bot. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.