-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Adds a new option to skip socket lb when in pod ns #13882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
One comment, and also it would be good if you could extend the k8sT/Services.go to check whether in the kube-proxy-less mode we don't break anything when the flag is enabled.
bpf/bpf_sock.c
Outdated
@@ -775,6 +780,11 @@ static __always_inline int __sock6_xlate_fwd(struct bpf_sock_addr *ctx, | |||
bool backend_from_affinity = false; | |||
__u32 backend_id = 0; | |||
|
|||
#ifdef SKIP_SK_LB_IN_POD_NS | |||
if (!in_hostns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep in mind that availability of the in_hostns
depends on the get_netns_cookie()
kernel helper introduced in https://lore.kernel.org/bpf/cover.1585323121.git.daniel@iogearbox.net/T/. So, you might want to check for the helpers availability in the agent, and notify the user about its absence in the case they have enabled SkipSkLBInPodNS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is checked in ctx_in_hostns()
. In the absence of such helpers, in_hostns
will be set to true
and we won't skip socket lb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the workaround ineffective on those particular kernels then? Or does this mean that host services will break when this flag is configured on those kernels?
If either of the above cases hold, it seems like that would be useful to notify the user about when they enable this flag and do not have the corresponding bpf helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the workaround ineffective on those particular kernels then
This. Therefore, I'm suggesting to warn users via log msg about Istio rules being bypasses. If we make the flag less generic, i.e. Istio specific, then instead of warning we could panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added logging in daemon/cmd/kube_proxy_replacement.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will result in odd semantics, meaning, for the BPF kube-proxy replacement we'll end up skipping everything for Pods, but not for host ns. The latter will be fine, but for Pods given we compile in xlation in bpf_lxc, we'll have ClusterIP services, and NodePort we'll pass through. The BPF code would then translate this on remote node via N-S LB, but for example externalIPs or hostPort translation for local node will not happen. This means this feature will really only work when you run the regular kube-proxy on the side to cover the remaining corner cases, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also update the kube-proxy-free guide with the concrete semantics? Probably good fit under: https://docs.cilium.io/en/v1.9/gettingstarted/kubeproxy-free/#kube-proxy-hybrid-modes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I'm reviewing what we did/needs to be done for this PR now and realized I completely overlooked this comment somehow a while ago...so basically are you saying all "host-reachable" services for pod ns won't work if this option is turned on? @borkmann
I agree we should have test for this, a bit confused as what should be included in the tests? I assume the basic cluster IP connectivity should be good? |
Added two basic connectivity test with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor items to address. I didn't review very deeply and will defer to others on the approach implemented here. I mostly looked at the changes at a surface level.
Also, just a meandering thought, but the description of the flag strikes me as very in-the-weeds, and I'm not sure that users will read it and understand what it's for. However, I do realize that the flag itself controls "in-the-weeds" behavior inherently...
Is this redirection achieved via iptable rules [1]? |
yup |
46ad445
to
895f346
Compare
status.ExpectSuccess("cannot curl to service IP from host") | ||
|
||
// Test connectivity from pod ns | ||
testCurlFromPods("id=app2", httpSVCURL, 10, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check existence of the cluster IP in the cilium monitor
output here when we have the config option enabled, and service translation doesn't happen via BPF socket prog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed how the test works, turns out cilium monitor
won't capture cluster IP in either case, now it's checking in the ct table, we should see service VIP in ct only when tc service resolution takes effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline - @Weil0ng is going to add a check for the cluster IP in the monitor output in order to validate that the flag works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
65e4722
to
8ae865c
Compare
retest-net-next |
This comment has been minimized.
This comment has been minimized.
What's the next steps on this PR? Are you looking for additional feedback @Weil0ng? Or you are currently addressing the outstanding feedback? |
I think the only thing left is the complexity issue, which I don't currently have a good solution. Last time we synced with @pchaigno , the general direction is to break |
Let's maybe switch this PR to draft until we have fixed the complexity issue? I don't expect a lot of conflicts even if you rebase after a couple weeks so should be fine. I don't think anyone as cycles now to address the complexity issue, but it should get better after the feature freeze. |
Considering that the recent k8s (AFAIK 1.20) started supporting cgroupv2, maybe it's worth to revisit other approach to solve the problem, i.e. to identify Istio pods? |
Sry for not tracking this lately, had a brief discussion during today's sync with @borkmann , idea is that if we can set |
Just want to chime in as this also affects Linkerd (not just Istio). |
--namespace kube-system \\ | ||
--set tunnel=disabled \\ | ||
--set autoDirectNodeRoutes=true \\ | ||
--set kubeProxyReplacement=strick \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo sneaked in here, shouldn't it be strict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
******************************************* | ||
|
||
Cilium has built-in support for bypassing the socket-level LoadBalancer and falling back | ||
to the tc LoadBalancer when a custom redirection/operation relies on the original ClusterIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a tc LoadBalancer
? Maybe clarify here if easily possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added "at the veth interface" to help clarify, PTAL :)
d40478b
to
ea9b163
Compare
This is for compatibility with Isio in kube-proxy free mode. Currently, even though Isiot would still get all traffic within pod namespace, but the original service VIP is lost during socket lb, causing it to miss all Istio routing chains and therefore bypassing all Istio functionalities. This adds a new option to bypass socket lb in pod namespace. When enabled, service resolution for connection from pod namespaces will be handled in bpf_lxc at veth. For host-namespaced pods, socket lb kept as is. Signed-off-by: Weilong Cui <cuiwl@google.com>
Any progress on this, is there anything we can help with to release this soonish (in |
This is for compatibility with Istio in kube-proxy free mode.
Currently, even though Istio would still get all traffic within pod
namespace, but the original service VIP is lost during socket lb,
causing it to miss all Istio routing chains and therefore bypassing
all Istio functionalities.
This adds a new option to bypass socket lb in pod namespace. When
enabled, service resolution for connection from pod namespaces will be
handled in bpf_lxc at veth. For host-namespaced pods, socket lb kept as
is.
Signed-off-by: Weilong Cui cuiwl@google.com