Skip to content

SLO Breakage for Kubernetes Pod Startup Latency at scale #22023

@alexkats

Description

@alexkats

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

One of the Kubernetes SLO is the Pod Startup Latency SLO. The current limit is the 99th percentile <= 5s. With the use of cilium Kubernetes is very close to this limit, sometimes even exceeding it.

All tests were performed on GKE's patched Cilium version based on Cilium OSS master. Here's the current status:

  • With 100 nodes it didn't go beyond 5 seconds, but comes pretty close from time to time to the limit.
  • When running with 500 nodes the average number for P99 is still within 5 seconds, but sometimes goes over and reaches around 5.5 seconds.
  • For 5k nodes the latency is at ~6 seconds on average, but goes even beyond it quite often.

The situation became better with the change #21505, but there're still problems with the latency.

Cilium Version

master

Kernel Version

5.10.109

Kubernetes Version

1.25.2-gke.800

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.need-more-infoMore information is required to further debug or fix the issue.pinnedThese issues are not marked stale by our issue bot.sig/scalabilityImpacts how well Cilium handles a high rate of events or churn.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions