Skip to content

CFP: Scalable node-to-node connectivity health-checking #32820

@marseel

Description

@marseel

Problem statement

Currently, we recommend turning off node-to-node connectivity testing for larger clusters, from https://docs.cilium.io/en/stable/operations/performance/scalability/report/ :

--set endpointHealthChecking.enabled=false and --set healthChecking=false disable endpoint health checking entirely.
 However it is recommended that those features be enabled initially on a smaller cluster (3-10 nodes) 
where it can be used to detect potential packet loss due to firewall rules or hypervisor settings.

I believe there are three reasons for that:

  1. By default, we export metrics with latency to each node separately, for example metric:
cilium_node_connectivity_latency_seconds{address_type="primary",protocol="http",source_cluster="kind-kind",source_node_name="kind-worker",target_cluster="kind-kind",target_node_ip="10.244.0.236",target_node_name="kind-control-plane",target_node_type="remote_intra_cluster",type="endpoint"} 0.000660872

which means, if you scrape cilium metrics, you would get O(n^2) metrics cardinality just from this metric alone.

Similarly, we have connectivity status, example metric:

cilium_node_connectivity_status{source_cluster="kind-kind",source_node_name="kind-worker",target_cluster="kind-kind",target_node_name="kind-control-plane",target_node_type="remote_intra_cluster",type="endpoint"} 1

that also includes source and destination node names, resulting in similar O(n^2) metrics cardinality.

  1. We do not spread ICMP pings across time, source code
    which means that periodically we burst a bunch of pings at the same time

  2. ProbeInterval is fixed to 60s:

    serverProbeInterval = 60 * time.Second

New node-to-node connectivity health checking proposal

For problem 1, instead of having a node-to-node metric that has O(n^2) cardinality, we could have histogram metrics without destination node or destination cluster name, only "type":

cilium_node_connectivity_latency_bucket{source_node_name="kind-worker", source_cluster="kind-kind", type="node", le="0.005"} 0
cilium_node_connectivity_latency_bucket{source_node_name="kind-worker", source_cluster="kind-kind", type="node", le="0.01"} 0
...

Similarly solution for metric cilium_node_connectivity_status

Both of these metrics would provide a high-level overview of latencies and status connectivity. If user would notice an increase of latencies or a change of connectivity status, they can take a more in-depth look by running:

cilium-dbg status --verbose
or
cilium-health status 
or
cilium-health status --probe

on the affected node, which would still output a full matrix of destination nodes with information about latencies and status per each remote node.

For problems 2 and 3, instead of running probes every minute (including icmp probes with burst and regular HTTP which are spread over time), let's introduce health checking that has a (configurable) fixed qps of probes, for example, 5 qps as a default per each type of probe. This would mean that:

  • We have consistent overhead for running probes as number of nodes in cluster/clustermesh grows
  • Because we use histogram for latencies, users will have to use rate(cilium_node_connectivity_latency_bucket...), which means they will be only observing latency of fresh results, but for a subset of nodes
  • connectivity_status metric will contain "more stale" results as cluster/clustermesh grows (*)
  • Users still can trigger probing on demand, by cilium-health status --probe to get a full "fresh" matrix of statuses/latencies, or use cilium-health status to check currently reported metrics.
  • If users want to tradeoff cpu/memory usage for "fresher" metrics, they can do that by increasing fixed qps of probes.

(*) For example, with 5,000 nodes and 5 qps, it would take approximately 15 minutes to update a full connectivity status.
We could also consider distinguishing between local-nodes and remote-nodes in clustermesh and probe them separately, while also exposing metrics with label like in-cluster=true/false

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.area/healthRelates to the cilium-health componentarea/metricsImpacts statistics / metrics gathering, eg via Prometheus.kind/cfpCilium Feature Proposalkind/featureThis introduces new functionality.sig/scalabilityImpacts how well Cilium handles a high rate of events or churn.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions