-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Problem statement
Currently, we recommend turning off node-to-node connectivity testing for larger clusters, from https://docs.cilium.io/en/stable/operations/performance/scalability/report/ :
--set endpointHealthChecking.enabled=false and --set healthChecking=false disable endpoint health checking entirely.
However it is recommended that those features be enabled initially on a smaller cluster (3-10 nodes)
where it can be used to detect potential packet loss due to firewall rules or hypervisor settings.
I believe there are three reasons for that:
- By default, we export metrics with latency to each node separately, for example metric:
cilium_node_connectivity_latency_seconds{address_type="primary",protocol="http",source_cluster="kind-kind",source_node_name="kind-worker",target_cluster="kind-kind",target_node_ip="10.244.0.236",target_node_name="kind-control-plane",target_node_type="remote_intra_cluster",type="endpoint"} 0.000660872
which means, if you scrape cilium metrics, you would get O(n^2) metrics cardinality just from this metric alone.
Similarly, we have connectivity status, example metric:
cilium_node_connectivity_status{source_cluster="kind-kind",source_node_name="kind-worker",target_cluster="kind-kind",target_node_name="kind-control-plane",target_node_type="remote_intra_cluster",type="endpoint"} 1
that also includes source and destination node names, resulting in similar O(n^2) metrics cardinality.
-
We do not spread ICMP pings across time, source code
which means that periodically we burst a bunch of pings at the same time -
ProbeInterval is fixed to 60s:
cilium/cilium-health/launch/launcher.go
Line 37 in d20f15e
serverProbeInterval = 60 * time.Second
New node-to-node connectivity health checking proposal
For problem 1, instead of having a node-to-node metric that has O(n^2) cardinality, we could have histogram metrics without destination node or destination cluster name, only "type":
cilium_node_connectivity_latency_bucket{source_node_name="kind-worker", source_cluster="kind-kind", type="node", le="0.005"} 0
cilium_node_connectivity_latency_bucket{source_node_name="kind-worker", source_cluster="kind-kind", type="node", le="0.01"} 0
...
Similarly solution for metric cilium_node_connectivity_status
Both of these metrics would provide a high-level overview of latencies and status connectivity. If user would notice an increase of latencies or a change of connectivity status, they can take a more in-depth look by running:
cilium-dbg status --verbose
or
cilium-health status
or
cilium-health status --probe
on the affected node, which would still output a full matrix of destination nodes with information about latencies and status per each remote node.
For problems 2 and 3, instead of running probes every minute (including icmp probes with burst and regular HTTP which are spread over time), let's introduce health checking that has a (configurable) fixed qps of probes, for example, 5 qps as a default per each type of probe. This would mean that:
- We have consistent overhead for running probes as number of nodes in cluster/clustermesh grows
- Because we use histogram for latencies, users will have to use
rate(cilium_node_connectivity_latency_bucket...)
, which means they will be only observing latency of fresh results, but for a subset of nodes - connectivity_status metric will contain "more stale" results as cluster/clustermesh grows (*)
- Users still can trigger probing on demand, by
cilium-health status --probe
to get a full "fresh" matrix of statuses/latencies, or usecilium-health status
to check currently reported metrics. - If users want to tradeoff cpu/memory usage for "fresher" metrics, they can do that by increasing fixed qps of probes.
(*) For example, with 5,000 nodes and 5 qps, it would take approximately 15 minutes to update a full connectivity status.
We could also consider distinguishing between local-nodes and remote-nodes in clustermesh and probe them separately, while also exposing metrics with label like in-cluster=true/false