Skip to content

Improve Cilium-agent Readiness/Liveness checks #25288

@marseel

Description

@marseel

Currently, Cilium-agent becomes ready even when it's not ready to handle new pods.
Let's consider case when Cilium is running with kvstore backend.

Cilium becomes ready even before connecting to kvstore, due to the fact that cilium readines checks error here which is nil by default. Depending on configuration of Cilium it can have different results:

  • If Cilium is unable to connect to kvstore, node still becomes ready and pods' Identities/Endpoints are created as CRDs and propagated
  • If CiliumEndpoint CRD is disabled, node becomes ready and pod's endpoints are not propagated
  • If identityAllocation is set to kvstore - node becomes ready, but pods are stuck in ContainerCreating status

We should:

  • Improve readiness for kvstore to take into account connection to kvstore
  • Wait for initial range requests to kvstore to be executed and processed

Side effects:

  • Rolling upgrade of cilium-agent will take more time, but reliability of network connectivity will improve during upgrade while also reducing load on kvstore
  • pods won't be able to start running during initialization unless they tolerate taint (especially important when Etcd is running within cluster with pod network)
  • cilium-agent will stay in startupProbe phase (now it instantly switches to ready and start performing readiness/liveness probes), giving us more flexibility on configuration for initialization time while reducing time to restart when liveness/readiness starts failing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentCilium agent related.kind/enhancementThis would improve or streamline existing functionality.staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions