-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Remove unused sections for bpf_lxc from nodeport header #21505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unused sections for bpf_lxc from nodeport header #21505
Conversation
@@ -571,7 +571,7 @@ int tail_nodeport_ipv6_dsr(struct __ctx_buff *ctx) | |||
} | |||
#endif /* ENABLE_DSR */ | |||
|
|||
__section_tail(CILIUM_MAP_CALLS, CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS) | |||
declare_tailcall_if(__not(is_defined(IS_BPF_LXC)), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this would be better but please consider using positive condition and if that's not possible then maybe defining positive condition separately, e.g.
declare_tailcall_if(is_defined(X_ENABLED), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)
declare_tailcall_if(is_defined(IS_BPF_ABC), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)
or
#define LOAD_NODEPORT_NAT_INGRESS (conidtions)
declare_tailcall_if(is_defined(LOAD_NODEPORT_NAT_INGRESS), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, the change looks good to me. Please provide us a release note and rebase it as it conflicts with the latest master.
https://docs.cilium.io/en/stable/contributing/development/contributing_guide/
Hi, thanks for approval, I added a release note and resolved conflicts. I believe this change should have a |
/test Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
One of the Kubernetes SLO is the Pod Startup Latency SLO. The current limit is the 99th percentile <= 5s. With the use of cilium Kubernetes is very close to this limit, sometimes even exceeding it. One of the major bottlenecks is eBPF program verification performed by kernel. These changes remove unused sections for bpf_lxc, which leads to reduced Pod Startup Latency by ~500ms. Signed-off-by: Alex Katsman <alexkats@google.com>
/test |
Hi, thanks for your help towards merging this PR. I have a question, whether this change could be a candidate for a backport to 1.12? |
@@ -11,6 +11,8 @@ | |||
|
|||
#include <linux/icmpv6.h> | |||
|
|||
#define IS_BPF_LXC 1 | |||
|
|||
/* Controls the inclusion of the CILIUM_CALL_SRV6 section in the object file. | |||
*/ | |||
#define SKIP_SRV6_HANDLING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use #ifndef SKIP_NODEPORT_NAT
instead of declare_tailcall_if
? As you can see above, we already use the first convention today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I saw both conventions being used, but as I wanted to conditionally only declare tail calls, I decided it'd be more natural to use declare_tailcall_if
instead of putting the whole code under macro.
This kind of change would typically not be a candidate for backporting based on the standard backporting criteria, as it is not a fix: https://docs.cilium.io/en/stable/contributing/release/backports/#backport-criteria That said, if we think the risk is low but benefit is high here then it may be worth backporting to 1.12.x. I know that there are occasionally user-reported issues around the ability of Cilium to create pods at a certain rate, and this could have a meaningful impact on those users. This seems at least reasonable to raise as a discussion point during the next community meeting, I'll put it on the agenda. |
Hi @alexkats, Specifically, could you share on which kernel(s) you made the measurements and with what Cilium configuration(s)?
In which conditions was it exceeding the limit? Your numbers show that it's always <5s, but I'm guessing some of the runs exceeded 5s? Is that still happening just less frequently with this patch?
I'm also a bit confused by these two sentences. Did you test this with Cilium OSS or with GKE's patched Cilium version? |
Hi @pchaigno, Thanks for the discussion. I used GKE's patched Cilium version based on Cilium OSS master (that's why I stated that I used the master branch). The kernel version used was 5.10.109. Actually, we try to aim for 3.5 seconds. We want to minimize the pod startup latency regression when changing kube-proxy based dataplane to Cilium on GKE. There is a k8s SLO - 99th percentile per cluster-day <= 5s, we have room to spend but we want to use as little as possible of it. I performed tests with different number of nodes in the cluster, specifically 100, 500 and 5k.
With the change we go further away from the limit for 100 and 500 nodes, and become closer to 5 seconds for 5k nodes. |
Hi @alexkats, It sounds like we still have a bug (SLO breakage) at scale. Could you open an issue so we can properly track that? We will also be able to refer to that issue as motivation when backporting the present pull request. Did you collect Cilium metrics when running those benchmarks? It would be useful to understand where the additional pod creation latency is coming from (e.g., slow kube-apiserver at scale). |
Background
One of the Kubernetes SLO is the Pod Startup Latency SLO. The current limit is the 99th percentile <= 5s. With the use of cilium Kubernetes is very close to this limit, sometimes even exceeding it. One of the major bottlenecks is eBPF program verification performed by kernel.
Change description
I went through the code to find kernel verification times for different sections, the most consuming ones were
tail_nodeport_nat_egress_ipv4
andtail_nodeport_nat_egress_ipv6
. These sections are actually not used by bpf_lxc at all, but added and verified by kernel. So this change removes unused for bpf_lxc sections from nodeport header. After this change the total verification time for all programs for bpf_lxc reduced almost twice (from ~500ms to ~270ms). And with all the routine to load program, these changes contributed to the whole latency drop even more.Test scenario
All tests were run on the cluster version 1.24.4-gke.300. The cilium master branch was used as a baseline for all the tests.
To test different configurations a GKE sandbox with custom Dataplane v2 images was used. The cluster was created with 100 nodes and DPv2 enabled. For Pod Startup Latency measurements a modified configuration of a GKE Scalability 100 nodes performance test was used.
In the beginning for each run 1 pod was created on each node with the same image as in Pod Startup Latency measurement, so image pull time is not affecting the resulting latency. Then 500 deployments with 5 pods each were created and waited until running to measure the pod startup latency.
Results
Result were aggregated from 20 separate test runs.
Summary
This optimization reduces the total latency by ~500ms and removes unnecessary sections loading to kernel.
Signed-off-by: Alex Katsman alexkats@google.com