Remove unused sections for bpf_lxc from nodeport header #21505

alexkats · 2022-09-29T14:36:19Z

Background
One of the Kubernetes SLO is the Pod Startup Latency SLO. The current limit is the 99th percentile <= 5s. With the use of cilium Kubernetes is very close to this limit, sometimes even exceeding it. One of the major bottlenecks is eBPF program verification performed by kernel.

Change description
I went through the code to find kernel verification times for different sections, the most consuming ones were tail_nodeport_nat_egress_ipv4 and tail_nodeport_nat_egress_ipv6. These sections are actually not used by bpf_lxc at all, but added and verified by kernel. So this change removes unused for bpf_lxc sections from nodeport header. After this change the total verification time for all programs for bpf_lxc reduced almost twice (from ~500ms to ~270ms). And with all the routine to load program, these changes contributed to the whole latency drop even more.

Test scenario
All tests were run on the cluster version 1.24.4-gke.300. The cilium master branch was used as a baseline for all the tests.

To test different configurations a GKE sandbox with custom Dataplane v2 images was used. The cluster was created with 100 nodes and DPv2 enabled. For Pod Startup Latency measurements a modified configuration of a GKE Scalability 100 nodes performance test was used.

In the beginning for each run 1 pod was created on each node with the same image as in Pod Startup Latency measurement, so image pull time is not affecting the resulting latency. Then 500 deployments with 5 pods each were created and waited until running to measure the pod startup latency.

Results

Result were aggregated from 20 separate test runs.

Percentile	Baseline	Optimized
P50	3088 ms	2584 ms
P90	3701 ms	3169 ms
P99	4126 ms	3573 ms

Summary
This optimization reduces the total latency by ~500ms and removes unnecessary sections loading to kernel.

Remove unused sections for bpf_lxc from nodeport.h

Signed-off-by: Alex Katsman alexkats@google.com

alan-kut · 2022-09-30T09:22:18Z

bpf/lib/nodeport.h

@@ -571,7 +571,7 @@ int tail_nodeport_ipv6_dsr(struct __ctx_buff *ctx)
 }
 #endif /* ENABLE_DSR */

-__section_tail(CILIUM_MAP_CALLS, CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)
+declare_tailcall_if(__not(is_defined(IS_BPF_LXC)), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)


I'm not sure if this would be better but please consider using positive condition and if that's not possible then maybe defining positive condition separately, e.g.

declare_tailcall_if(is_defined(X_ENABLED), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS) declare_tailcall_if(is_defined(IS_BPF_ABC), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)

or

#define LOAD_NODEPORT_NAT_INGRESS (conidtions) declare_tailcall_if(is_defined(LOAD_NODEPORT_NAT_INGRESS), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)

YutaroHayakawa

Hi, the change looks good to me. Please provide us a release note and rebase it as it conflicts with the latest master.

https://docs.cilium.io/en/stable/contributing/development/contributing_guide/

alexkats · 2022-10-06T12:23:13Z

Hi, thanks for approval, I added a release note and resolved conflicts. I believe this change should have a release-note/misc label (based on what I see in the contributing guide), but I'm not able to change labels.

sayboras · 2022-10-08T12:33:00Z

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sAgentPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath

Failure Output

FAIL: connectivity-check pods are not ready after timeout

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

One of the Kubernetes SLO is the Pod Startup Latency SLO. The current limit is the 99th percentile <= 5s. With the use of cilium Kubernetes is very close to this limit, sometimes even exceeding it. One of the major bottlenecks is eBPF program verification performed by kernel. These changes remove unused sections for bpf_lxc, which leads to reduced Pod Startup Latency by ~500ms. Signed-off-by: Alex Katsman <alexkats@google.com>

qmonnet · 2022-10-14T10:33:11Z

/test

alexkats · 2022-10-17T10:51:37Z

Hi, thanks for your help towards merging this PR. I have a question, whether this change could be a candidate for a backport to 1.12?

pchaigno · 2022-10-17T15:26:28Z

bpf/bpf_lxc.c

@@ -11,6 +11,8 @@

 #include <linux/icmpv6.h>

+#define IS_BPF_LXC 1
+
 /* Controls the inclusion of the CILIUM_CALL_SRV6 section in the object file.
 */
 #define SKIP_SRV6_HANDLING


Why not use #ifndef SKIP_NODEPORT_NAT instead of declare_tailcall_if? As you can see above, we already use the first convention today.

Hi, I saw both conventions being used, but as I wanted to conditionally only declare tail calls, I decided it'd be more natural to use declare_tailcall_if instead of putting the whole code under macro.

joestringer · 2022-10-18T01:04:12Z

This kind of change would typically not be a candidate for backporting based on the standard backporting criteria, as it is not a fix: https://docs.cilium.io/en/stable/contributing/release/backports/#backport-criteria

That said, if we think the risk is low but benefit is high here then it may be worth backporting to 1.12.x. I know that there are occasionally user-reported issues around the ability of Cilium to create pods at a certain rate, and this could have a meaningful impact on those users. This seems at least reasonable to raise as a discussion point during the next community meeting, I'll put it on the agenda.

pchaigno · 2022-10-19T16:02:46Z

Hi @alexkats,
We discussed this change and its backport in the community meeting today. It would help to have a bit more information on this bug (the broken SLO specifically) and how common it might be.

Specifically, could you share on which kernel(s) you made the measurements and with what Cilium configuration(s)?

With the use of cilium Kubernetes is very close to this limit, sometimes even exceeding it.

In which conditions was it exceeding the limit? Your numbers show that it's always <5s, but I'm guessing some of the runs exceeded 5s? Is that still happening just less frequently with this patch?

The cilium master branch was used as a baseline for all the tests. To test different configurations a GKE sandbox with custom Dataplane v2 images was used.

I'm also a bit confused by these two sentences. Did you test this with Cilium OSS or with GKE's patched Cilium version?

alexkats · 2022-10-24T13:49:48Z

Hi @pchaigno,

Thanks for the discussion.

I used GKE's patched Cilium version based on Cilium OSS master (that's why I stated that I used the master branch). The kernel version used was 5.10.109.

Actually, we try to aim for 3.5 seconds. We want to minimize the pod startup latency regression when changing kube-proxy based dataplane to Cilium on GKE. There is a k8s SLO - 99th percentile per cluster-day <= 5s, we have room to spend but we want to use as little as possible of it.

I performed tests with different number of nodes in the cluster, specifically 100, 500 and 5k.

With 100 nodes it didn't go beyond 5 seconds, but comes pretty close from time to time to the limit.
When running with 500 nodes the average number for P99 is still within 5 seconds, but sometimes goes over and reaches around 5.5 seconds.
For 5k nodes the latency is at ~6 seconds on average, but goes even beyond it quite often.

With the change we go further away from the limit for 100 and 500 nodes, and become closer to 5 seconds for 5k nodes.

pchaigno · 2022-10-27T17:33:29Z

Hi @alexkats,
Thanks for the details!

It sounds like we still have a bug (SLO breakage) at scale. Could you open an issue so we can properly track that? We will also be able to refer to that issue as motivation when backporting the present pull request.

Did you collect Cilium metrics when running those benchmarks? It would be useful to understand where the additional pod creation latency is coming from (e.g., slow kube-apiserver at scale).

alexkats · 2022-11-07T11:22:17Z

Hi @pchaigno ,

I created the issue #22023.

Regarding metrics, I didn't really take a look into them yet, but I plan to do it when running new tests.

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 29, 2022

alexkats marked this pull request as ready for review September 30, 2022 09:19

alexkats requested a review from a team as a code owner September 30, 2022 09:19

alexkats requested a review from YutaroHayakawa September 30, 2022 09:19

alan-kut approved these changes Sep 30, 2022

View reviewed changes

YutaroHayakawa approved these changes Oct 6, 2022

View reviewed changes

YutaroHayakawa added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Oct 6, 2022

sayboras added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Oct 8, 2022

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 8, 2022

sayboras closed this Oct 8, 2022

sayboras reopened this Oct 8, 2022

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 14, 2022

qmonnet merged commit b41954d into cilium:master Oct 14, 2022

alexkats deleted the remove-unused-sections-from-lxc branch October 14, 2022 15:24

pchaigno reviewed Oct 17, 2022

View reviewed changes

alexkats mentioned this pull request Nov 7, 2022

SLO Breakage for Kubernetes Pod Startup Latency at scale #22023

Open

2 tasks

pchaigno added the needs-backport/1.12 label Nov 7, 2022

pchaigno mentioned this pull request Nov 7, 2022

v1.12 backports 2022-11-07 #22028

Merged

pchaigno added backport-pending/1.12 and removed needs-backport/1.12 labels Nov 7, 2022

michi-covalent mentioned this pull request Nov 16, 2022

Prepare for release v1.12.4 #22201

Merged

joestringer added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove unused sections for bpf_lxc from nodeport header #21505

Remove unused sections for bpf_lxc from nodeport header #21505

Uh oh!

alexkats commented Sep 29, 2022 •

edited

Loading

Uh oh!

alan-kut Sep 30, 2022

Uh oh!

YutaroHayakawa left a comment

Uh oh!

alexkats commented Oct 6, 2022

Uh oh!

sayboras commented Oct 8, 2022 •

edited by maintainer-s-little-helper bot

Loading

Test Name

Failure Output

Uh oh!

qmonnet commented Oct 14, 2022

Uh oh!

alexkats commented Oct 17, 2022

Uh oh!

pchaigno Oct 17, 2022

Uh oh!

alexkats Oct 19, 2022

Uh oh!

joestringer commented Oct 18, 2022

Uh oh!

pchaigno commented Oct 19, 2022

Uh oh!

alexkats commented Oct 24, 2022

Uh oh!

pchaigno commented Oct 27, 2022

Uh oh!

alexkats commented Nov 7, 2022

Uh oh!

Uh oh!

Remove unused sections for bpf_lxc from nodeport header #21505

Remove unused sections for bpf_lxc from nodeport header #21505

Uh oh!

Conversation

alexkats commented Sep 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alan-kut Sep 30, 2022

Choose a reason for hiding this comment

Uh oh!

YutaroHayakawa left a comment

Choose a reason for hiding this comment

Uh oh!

alexkats commented Oct 6, 2022

Uh oh!

sayboras commented Oct 8, 2022 • edited by maintainer-s-little-helper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Name

Failure Output

Uh oh!

qmonnet commented Oct 14, 2022

Uh oh!

alexkats commented Oct 17, 2022

Uh oh!

pchaigno Oct 17, 2022

Choose a reason for hiding this comment

Uh oh!

alexkats Oct 19, 2022

Choose a reason for hiding this comment

Uh oh!

joestringer commented Oct 18, 2022

Uh oh!

pchaigno commented Oct 19, 2022

Uh oh!

alexkats commented Oct 24, 2022

Uh oh!

pchaigno commented Oct 27, 2022

Uh oh!

alexkats commented Nov 7, 2022

Uh oh!

Uh oh!

alexkats commented Sep 29, 2022 •

edited

Loading

sayboras commented Oct 8, 2022 •

edited by maintainer-s-little-helper bot

Loading