CFP: Scalable node-to-node connectivity health-checking

## Problem statement
        
Currently, we recommend turning off node-to-node connectivity testing for larger clusters, from https://docs.cilium.io/en/stable/operations/performance/scalability/report/ :
```
--set endpointHealthChecking.enabled=false and --set healthChecking=false disable endpoint health checking entirely.
 However it is recommended that those features be enabled initially on a smaller cluster (3-10 nodes) 
where it can be used to detect potential packet loss due to firewall rules or hypervisor settings.
```

I believe there are three reasons for that:

1.  By default, we export metrics with latency to each node separately, for example metric:
```
cilium_node_connectivity_latency_seconds{address_type="primary",protocol="http",source_cluster="kind-kind",source_node_name="kind-worker",target_cluster="kind-kind",target_node_ip="10.244.0.236",target_node_name="kind-control-plane",target_node_type="remote_intra_cluster",type="endpoint"} 0.000660872
```
which means, if you scrape cilium metrics, you would get O(n^2) metrics cardinality just from this metric alone. 

Similarly, we have connectivity status, example metric:
```
cilium_node_connectivity_status{source_cluster="kind-kind",source_node_name="kind-worker",target_cluster="kind-kind",target_node_name="kind-control-plane",target_node_type="remote_intra_cluster",type="endpoint"} 1
```
that also includes source and destination node names, resulting in similar O(n^2) metrics cardinality. 

2.  We do not spread ICMP pings across time, [source code](https://github.com/cilium/cilium/blob/d6e7c5d1ceb1f72297c2dc7f0e7b43fa92db846f/pkg/health/server/prober.go#L392)
which means that periodically we burst a bunch of pings at the same time

3. ProbeInterval is fixed to 60s: https://github.com/cilium/cilium/blob/d20f15ecab7c157f6246a07c857662bec491f6ee/cilium-health/launch/launcher.go#L37

## New node-to-node connectivity health checking proposal

For problem 1, instead of having a node-to-node metric that has O(n^2) cardinality, we could have histogram metrics without destination node or destination cluster name, only "type":
```
cilium_node_connectivity_latency_bucket{source_node_name="kind-worker", source_cluster="kind-kind", type="node", le="0.005"} 0
cilium_node_connectivity_latency_bucket{source_node_name="kind-worker", source_cluster="kind-kind", type="node", le="0.01"} 0
...
```
Similarly solution for metric `cilium_node_connectivity_status`

Both of these metrics would provide a high-level overview of latencies and status connectivity. If user would notice an increase of latencies or a change of connectivity status, they can take a more in-depth look by running:
```
cilium-dbg status --verbose
or
cilium-health status 
or
cilium-health status --probe
```
on the affected node, which would still output a full matrix of destination nodes with information about latencies and status per each remote node.

For problems 2 and 3, instead of running probes every minute (including icmp probes with burst and regular HTTP which are spread over time), let's introduce health checking that has a (configurable) fixed qps of probes, for example, 5 qps as a default per each type of probe. This would mean that:
- We have consistent overhead for running probes as number of nodes in cluster/clustermesh grows
- Because we use histogram for latencies, users will have to use `rate(cilium_node_connectivity_latency_bucket...)`, which means they will be only observing latency of fresh results, but for a subset of nodes
- connectivity_status metric will contain "more stale" results as cluster/clustermesh grows (*)
- Users still can trigger probing on demand, by `cilium-health status --probe` to get a full "fresh" matrix of statuses/latencies, or use `cilium-health status ` to check currently reported metrics.
- If users want to tradeoff cpu/memory usage for "fresher" metrics, they can do that by increasing fixed qps of probes.

(*) For example, with 5,000 nodes and 5 qps, it would take approximately  15 minutes to update a full connectivity status.
We could also consider distinguishing between local-nodes and remote-nodes in clustermesh and probe them separately, while also exposing metrics with label like `in-cluster=true/false`





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CFP: Scalable node-to-node connectivity health-checking #32820

Problem statement

New node-to-node connectivity health checking proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CFP: Scalable node-to-node connectivity health-checking #32820

Description

Problem statement

New node-to-node connectivity health checking proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions