Adds a new option to skip socket lb when in pod ns #13882

Weil0ng · 2020-11-03T21:02:22Z

This is for compatibility with Istio in kube-proxy free mode.
Currently, even though Istio would still get all traffic within pod
namespace, but the original service VIP is lost during socket lb,
causing it to miss all Istio routing chains and therefore bypassing
all Istio functionalities.

This adds a new option to bypass socket lb in pod namespace. When
enabled, service resolution for connection from pod namespaces will be
handled in bpf_lxc at veth. For host-namespaced pods, socket lb kept as
is.

Signed-off-by: Weilong Cui cuiwl@google.com

brb

Thanks!

One comment, and also it would be good if you could extend the k8sT/Services.go to check whether in the kube-proxy-less mode we don't break anything when the flag is enabled.

brb · 2020-11-04T14:41:36Z

bpf/bpf_sock.c

@@ -775,6 +780,11 @@ static __always_inline int __sock6_xlate_fwd(struct bpf_sock_addr *ctx,
 	bool backend_from_affinity = false;
 	__u32 backend_id = 0;

+#ifdef SKIP_SK_LB_IN_POD_NS
+	if (!in_hostns)


Please keep in mind that availability of the in_hostns depends on the get_netns_cookie() kernel helper introduced in https://lore.kernel.org/bpf/cover.1585323121.git.daniel@iogearbox.net/T/. So, you might want to check for the helpers availability in the agent, and notify the user about its absence in the case they have enabled SkipSkLBInPodNS.

I believe this is checked in ctx_in_hostns(). In the absence of such helpers, in_hostns will be set to true and we won't skip socket lb.

Is the workaround ineffective on those particular kernels then? Or does this mean that host services will break when this flag is configured on those kernels?

If either of the above cases hold, it seems like that would be useful to notify the user about when they enable this flag and do not have the corresponding bpf helper.

Is the workaround ineffective on those particular kernels then

This. Therefore, I'm suggesting to warn users via log msg about Istio rules being bypasses. If we make the flag less generic, i.e. Istio specific, then instead of warning we could panic.

Added logging in daemon/cmd/kube_proxy_replacement.go.

I think this will result in odd semantics, meaning, for the BPF kube-proxy replacement we'll end up skipping everything for Pods, but not for host ns. The latter will be fine, but for Pods given we compile in xlation in bpf_lxc, we'll have ClusterIP services, and NodePort we'll pass through. The BPF code would then translate this on remote node via N-S LB, but for example externalIPs or hostPort translation for local node will not happen. This means this feature will really only work when you run the regular kube-proxy on the side to cover the remaining corner cases, is that intentional?

Could you also update the kube-proxy-free guide with the concrete semantics? Probably good fit under: https://docs.cilium.io/en/v1.9/gettingstarted/kubeproxy-free/#kube-proxy-hybrid-modes

hmm, I'm reviewing what we did/needs to be done for this PR now and realized I completely overlooked this comment somehow a while ago...so basically are you saying all "host-reachable" services for pod ns won't work if this option is turned on? @borkmann

Weil0ng · 2020-11-04T22:41:00Z

Thanks!

One comment, and also it would be good if you could extend the k8sT/Services.go to check whether in the kube-proxy-less mode we don't break anything when the flag is enabled.

I agree we should have test for this, a bit confused as what should be included in the tests? I assume the basic cluster IP connectivity should be good?

Weil0ng · 2020-11-13T00:09:25Z

Thanks!

One comment, and also it would be good if you could extend the k8sT/Services.go to check whether in the kube-proxy-less mode we don't break anything when the flag is enabled.

Added two basic connectivity test with skipSocketLB set to true, PTAL :)

christarazi

A few minor items to address. I didn't review very deeply and will defer to others on the approach implemented here. I mostly looked at the changes at a surface level.

Also, just a meandering thought, but the description of the flag strikes me as very in-the-weeds, and I'm not sure that users will read it and understand what it's for. However, I do realize that the flag itself controls "in-the-weeds" behavior inherently...

test/k8sT/Services.go

daemon/cmd/daemon_main.go

aditighag · 2020-11-13T01:43:06Z

Currently, even though Isiot would still get all traffic within pod
namespace

Is this redirection achieved via iptable rules [1]?

[1] https://github.com/istio/istio/wiki/Proxy-redirection

Weil0ng · 2020-11-13T22:09:44Z

Currently, even though Isiot would still get all traffic within pod
namespace

Is this redirection achieved via iptable rules [1]?

[1] https://github.com/istio/istio/wiki/Proxy-redirection

yup

aditighag · 2020-11-16T23:56:53Z

test/k8sT/Services.go

+			status.ExpectSuccess("cannot curl to service IP from host")
+
+			// Test connectivity from pod ns
+			testCurlFromPods("id=app2", httpSVCURL, 10, 0)


Should we check existence of the cluster IP in the cilium monitor output here when we have the config option enabled, and service translation doesn't happen via BPF socket prog?

Good point, done.

Changed how the test works, turns out cilium monitor won't capture cluster IP in either case, now it's checking in the ct table, we should see service VIP in ct only when tc service resolution takes effect.

Discussed offline - @Weil0ng is going to add a check for the cluster IP in the monitor output in order to validate that the flag works.

Weil0ng · 2020-12-14T20:35:05Z

retest-net-next

joestringer · 2021-02-24T22:33:50Z

What's the next steps on this PR? Are you looking for additional feedback @Weil0ng? Or you are currently addressing the outstanding feedback?

Weil0ng · 2021-03-03T01:43:56Z

What's the next steps on this PR? Are you looking for additional feedback @Weil0ng? Or you are currently addressing the outstanding feedback?

I think the only thing left is the complexity issue, which I don't currently have a good solution. Last time we synced with @pchaigno , the general direction is to break bpf_lxc into more tail calls, but I think that probably should be tackled by another PR?

pchaigno · 2021-03-08T08:55:20Z

Let's maybe switch this PR to draft until we have fixed the complexity issue? I don't expect a lot of conflicts even if you rebase after a couple weeks so should be fine. I don't think anyone as cycles now to address the complexity issue, but it should get better after the feature freeze.

brb · 2021-03-08T09:00:49Z

Ah right, yeah the only way we can do that today is with netns cookie which is available in Linux 5.7 or later. When kubernetes moves to cgroupv2 (KEP), it should be possible to perform this on older kernels.

Considering that the recent k8s (AFAIK 1.20) started supporting cgroupv2, maybe it's worth to revisit other approach to solve the problem, i.e. to identify Istio pods?

Weil0ng · 2021-04-20T19:27:03Z

Sry for not tracking this lately, had a brief discussion during today's sync with @borkmann , idea is that if we can set SO_ORIGINAL_DST with a bpf sockopt program with cgroupv2, then maybe that's enough for istio, the redirection part is relatively straightforward, LRP can probably be used there. I'll investigate the sockopt capabilities and comment back.

alex-berger · 2021-06-10T09:10:08Z

Just want to chime in as this also affects Linkerd (not just Istio).

alex-berger · 2021-06-10T09:18:12Z

Documentation/gettingstarted/kubeproxy-free.rst

+        --namespace kube-system \\
+        --set tunnel=disabled \\
+        --set autoDirectNodeRoutes=true \\
+        --set kubeProxyReplacement=strick \\


typo sneaked in here, shouldn't it be strict

kolloch · 2021-06-10T11:31:12Z

Documentation/gettingstarted/kubeproxy-free.rst

+*******************************************
+
+Cilium has built-in support for bypassing the socket-level LoadBalancer and falling back
+to the tc LoadBalancer when a custom redirection/operation relies on the original ClusterIP


What is a tc LoadBalancer? Maybe clarify here if easily possible.

Added "at the veth interface" to help clarify, PTAL :)

This is for compatibility with Isio in kube-proxy free mode. Currently, even though Isiot would still get all traffic within pod namespace, but the original service VIP is lost during socket lb, causing it to miss all Istio routing chains and therefore bypassing all Istio functionalities. This adds a new option to bypass socket lb in pod namespace. When enabled, service resolution for connection from pod namespaces will be handled in bpf_lxc at veth. For host-namespaced pods, socket lb kept as is. Signed-off-by: Weilong Cui <cuiwl@google.com>

alex-berger · 2021-08-13T06:44:17Z

Any progress on this, is there anything we can help with to release this soonish (in 1.10.* or at least 1.11.*)?

brb · 2021-08-24T08:49:00Z

Closing in favor of #17154. /cc @Weil0ng

Weil0ng requested review from brb, borkmann, joestringer and a team November 3, 2020 21:02

Weil0ng requested a review from a team as a code owner November 3, 2020 21:02

maintainer-s-little-helper bot assigned borkmann, brb and joestringer Nov 3, 2020

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 3, 2020

Weil0ng force-pushed the netns branch from 80a480a to 9000a9f Compare November 4, 2020 01:18

brb requested changes Nov 4, 2020

View reviewed changes

aanm added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Nov 4, 2020

aanm removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Nov 10, 2020

Weil0ng force-pushed the netns branch from 9000a9f to eaf68d7 Compare November 13, 2020 00:07

Weil0ng requested review from a team as code owners November 13, 2020 00:07

Weil0ng requested review from a team and kaworu November 13, 2020 00:07

maintainer-s-little-helper bot assigned kaworu Nov 13, 2020

christarazi requested changes Nov 13, 2020

View reviewed changes

test/k8sT/Services.go Outdated Show resolved Hide resolved

daemon/cmd/daemon_main.go Outdated Show resolved Hide resolved

Weil0ng force-pushed the netns branch 2 times, most recently from 46ad445 to 895f346 Compare November 16, 2020 23:41

aditighag reviewed Nov 16, 2020

View reviewed changes

Weil0ng force-pushed the netns branch from 895f346 to c740b25 Compare November 17, 2020 00:01

kaworu requested a review from a team December 8, 2020 11:08

kaworu removed their assignment Dec 8, 2020

Weil0ng force-pushed the netns branch 2 times, most recently from 65e4722 to 8ae865c Compare December 14, 2020 20:34

This comment has been minimized.

Sign in to view

stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 17, 2021

borkmann added the pinned These issues are not marked stale by our issue bot. label Jan 18, 2021

stale bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 18, 2021

Weil0ng marked this pull request as draft March 10, 2021 16:02

Weil0ng force-pushed the netns branch from 8ae865c to 247eac1 Compare June 2, 2021 17:32

alex-berger reviewed Jun 10, 2021

View reviewed changes

kolloch reviewed Jun 10, 2021

View reviewed changes

alex-berger mentioned this pull request Jun 11, 2021

Request continuously routed to terminating pod linkerd/linkerd2#6238

Closed

brb mentioned this pull request Jun 17, 2021

datapath: Istio and Linkerd do not work with socket-LB #16583

Closed

alex-berger mentioned this pull request Jul 1, 2021

Test how linkerd and cillium interact linkerd/linkerd2#6246

Open

Weil0ng force-pushed the netns branch 2 times, most recently from d40478b to ea9b163 Compare July 9, 2021 23:26

Weil0ng force-pushed the netns branch from ea9b163 to 17660b8 Compare July 9, 2021 23:36

brb mentioned this pull request Aug 24, 2021

datapath: Add a new option to skip socket lb when in pod ns #17154

Merged

brb closed this Aug 24, 2021

Adds a new option to skip socket lb when in pod ns #13882

Adds a new option to skip socket lb when in pod ns #13882

Uh oh!

Conversation

Weil0ng commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Weil0ng commented Nov 4, 2020

Uh oh!

Weil0ng commented Nov 13, 2020

Uh oh!

christarazi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aditighag commented Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Weil0ng commented Nov 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Weil0ng commented Dec 14, 2020

Uh oh!

This comment has been minimized.

joestringer commented Feb 24, 2021

Uh oh!

Weil0ng commented Mar 3, 2021

Uh oh!

pchaigno commented Mar 8, 2021

Uh oh!

brb commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Weil0ng commented Apr 20, 2021

Uh oh!

alex-berger commented Jun 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-berger commented Aug 13, 2021

Uh oh!

brb commented Aug 24, 2021

Uh oh!

Uh oh!

Weil0ng commented Nov 3, 2020 •

edited

Loading

christarazi left a comment •

edited

Loading

aditighag commented Nov 13, 2020 •

edited

Loading

brb commented Mar 8, 2021 •

edited

Loading