Skip to content

Conversation

alexkats
Copy link

@alexkats alexkats commented Sep 29, 2022

Background
One of the Kubernetes SLO is the Pod Startup Latency SLO. The current limit is the 99th percentile <= 5s. With the use of cilium Kubernetes is very close to this limit, sometimes even exceeding it. One of the major bottlenecks is eBPF program verification performed by kernel.

Change description
I went through the code to find kernel verification times for different sections, the most consuming ones were tail_nodeport_nat_egress_ipv4 and tail_nodeport_nat_egress_ipv6. These sections are actually not used by bpf_lxc at all, but added and verified by kernel. So this change removes unused for bpf_lxc sections from nodeport header. After this change the total verification time for all programs for bpf_lxc reduced almost twice (from ~500ms to ~270ms). And with all the routine to load program, these changes contributed to the whole latency drop even more.

Test scenario
All tests were run on the cluster version 1.24.4-gke.300. The cilium master branch was used as a baseline for all the tests.

To test different configurations a GKE sandbox with custom Dataplane v2 images was used. The cluster was created with 100 nodes and DPv2 enabled. For Pod Startup Latency measurements a modified configuration of a GKE Scalability 100 nodes performance test was used.

In the beginning for each run 1 pod was created on each node with the same image as in Pod Startup Latency measurement, so image pull time is not affecting the resulting latency. Then 500 deployments with 5 pods each were created and waited until running to measure the pod startup latency.

Results

Result were aggregated from 20 separate test runs.

Percentile Baseline Optimized
P50 3088 ms 2584 ms
P90 3701 ms 3169 ms
P99 4126 ms 3573 ms

Summary
This optimization reduces the total latency by ~500ms and removes unnecessary sections loading to kernel.

Remove unused sections for bpf_lxc from nodeport.h

Signed-off-by: Alex Katsman alexkats@google.com

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 29, 2022
@alexkats alexkats marked this pull request as ready for review September 30, 2022 09:19
@alexkats alexkats requested a review from a team as a code owner September 30, 2022 09:19
@@ -571,7 +571,7 @@ int tail_nodeport_ipv6_dsr(struct __ctx_buff *ctx)
}
#endif /* ENABLE_DSR */

__section_tail(CILIUM_MAP_CALLS, CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)
declare_tailcall_if(__not(is_defined(IS_BPF_LXC)), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this would be better but please consider using positive condition and if that's not possible then maybe defining positive condition separately, e.g.

declare_tailcall_if(is_defined(X_ENABLED), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)
declare_tailcall_if(is_defined(IS_BPF_ABC), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)

or

#define LOAD_NODEPORT_NAT_INGRESS (conidtions)
declare_tailcall_if(is_defined(LOAD_NODEPORT_NAT_INGRESS), CILIUM_CALL_IPV6_NODEPORT_NAT_INGRESS)

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, the change looks good to me. Please provide us a release note and rebase it as it conflicts with the latest master.

https://docs.cilium.io/en/stable/contributing/development/contributing_guide/

@YutaroHayakawa YutaroHayakawa added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Oct 6, 2022
@alexkats
Copy link
Author

alexkats commented Oct 6, 2022

Hi, thanks for approval, I added a release note and resolved conflicts. I believe this change should have a release-note/misc label (based on what I see in the contributing guide), but I'm not able to change labels.

@sayboras sayboras added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Oct 8, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 8, 2022
@sayboras sayboras added release-note/misc This PR makes changes that have no direct user impact. and removed dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Oct 8, 2022
@sayboras sayboras closed this Oct 8, 2022
@sayboras sayboras reopened this Oct 8, 2022
@sayboras
Copy link
Member

sayboras commented Oct 8, 2022

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed:

Click to show.

Test Name

K8sAgentPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath

Failure Output

FAIL: connectivity-check pods are not ready after timeout

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.9 so I can create one.

One of the Kubernetes SLO is the Pod Startup Latency SLO. The
current limit is the 99th percentile <= 5s. With the use of cilium
Kubernetes is very close to this limit, sometimes even exceeding it.
One of the major bottlenecks is eBPF program verification performed by
kernel. These changes remove unused sections for bpf_lxc, which
leads to reduced Pod Startup Latency by ~500ms.

Signed-off-by: Alex Katsman <alexkats@google.com>
@qmonnet
Copy link
Member

qmonnet commented Oct 14, 2022

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 14, 2022
@qmonnet qmonnet merged commit b41954d into cilium:master Oct 14, 2022
@alexkats alexkats deleted the remove-unused-sections-from-lxc branch October 14, 2022 15:24
@alexkats
Copy link
Author

Hi, thanks for your help towards merging this PR. I have a question, whether this change could be a candidate for a backport to 1.12?

@@ -11,6 +11,8 @@

#include <linux/icmpv6.h>

#define IS_BPF_LXC 1

/* Controls the inclusion of the CILIUM_CALL_SRV6 section in the object file.
*/
#define SKIP_SRV6_HANDLING
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use #ifndef SKIP_NODEPORT_NAT instead of declare_tailcall_if? As you can see above, we already use the first convention today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I saw both conventions being used, but as I wanted to conditionally only declare tail calls, I decided it'd be more natural to use declare_tailcall_if instead of putting the whole code under macro.

@joestringer
Copy link
Member

This kind of change would typically not be a candidate for backporting based on the standard backporting criteria, as it is not a fix: https://docs.cilium.io/en/stable/contributing/release/backports/#backport-criteria

That said, if we think the risk is low but benefit is high here then it may be worth backporting to 1.12.x. I know that there are occasionally user-reported issues around the ability of Cilium to create pods at a certain rate, and this could have a meaningful impact on those users. This seems at least reasonable to raise as a discussion point during the next community meeting, I'll put it on the agenda.

@pchaigno
Copy link
Member

Hi @alexkats,
We discussed this change and its backport in the community meeting today. It would help to have a bit more information on this bug (the broken SLO specifically) and how common it might be.

Specifically, could you share on which kernel(s) you made the measurements and with what Cilium configuration(s)?

With the use of cilium Kubernetes is very close to this limit, sometimes even exceeding it.

In which conditions was it exceeding the limit? Your numbers show that it's always <5s, but I'm guessing some of the runs exceeded 5s? Is that still happening just less frequently with this patch?

The cilium master branch was used as a baseline for all the tests. To test different configurations a GKE sandbox with custom Dataplane v2 images was used.

I'm also a bit confused by these two sentences. Did you test this with Cilium OSS or with GKE's patched Cilium version?

@alexkats
Copy link
Author

Hi @pchaigno,

Thanks for the discussion.

I used GKE's patched Cilium version based on Cilium OSS master (that's why I stated that I used the master branch). The kernel version used was 5.10.109.

Actually, we try to aim for 3.5 seconds. We want to minimize the pod startup latency regression when changing kube-proxy based dataplane to Cilium on GKE. There is a k8s SLO - 99th percentile per cluster-day <= 5s, we have room to spend but we want to use as little as possible of it.

I performed tests with different number of nodes in the cluster, specifically 100, 500 and 5k.

  • With 100 nodes it didn't go beyond 5 seconds, but comes pretty close from time to time to the limit.
  • When running with 500 nodes the average number for P99 is still within 5 seconds, but sometimes goes over and reaches around 5.5 seconds.
  • For 5k nodes the latency is at ~6 seconds on average, but goes even beyond it quite often.

With the change we go further away from the limit for 100 and 500 nodes, and become closer to 5 seconds for 5k nodes.

@pchaigno
Copy link
Member

Hi @alexkats,
Thanks for the details!

It sounds like we still have a bug (SLO breakage) at scale. Could you open an issue so we can properly track that? We will also be able to refer to that issue as motivation when backporting the present pull request.

Did you collect Cilium metrics when running those benchmarks? It would be useful to understand where the additional pod creation latency is coming from (e.g., slow kube-apiserver at scale).

@alexkats
Copy link
Author

alexkats commented Nov 7, 2022

Hi @pchaigno ,

I created the issue #22023.

Regarding metrics, I didn't really take a look into them yet, but I plan to do it when running new tests.

@joestringer joestringer added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants