-
Notifications
You must be signed in to change notification settings - Fork 3.4k
metrics: reduce cardinality of nodeconnectivity metrics #33103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0e20555
to
3a934d5
Compare
aceafc2
to
cfb6d8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc change looks good; I only glanced quickly over the rest.
I'd recommend adding the motivation for the change to your commit description and PR description. It's not clear why we reduce the cardinality when reading the commit log; we have to go and check the related issue. I'd rather have the motivation for the patch clearly mentioned here.
This comment was marked as outdated.
This comment was marked as outdated.
b8f1baa
to
5f549a9
Compare
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, only small nits :)
38b29b1
to
d81bff4
Compare
Force push for constant timeout diff:
|
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment, expect for that lgtm 🎉
This comment was marked as resolved.
This comment was marked as resolved.
86d2e79
to
6be5857
Compare
This comment was marked as resolved.
This comment was marked as resolved.
6be5857
to
72ff857
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one hopefully last blocking comment, I must have missed that on the previous review (sorry!)
4263608
to
cfb1970
Compare
/test |
ef1008d
to
5dd290a
Compare
/test |
1 similar comment
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks for the changes 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
This PR deprecates the existing metrics `node_connectivity_latency_seconds` and `node_connectivity_status`, intending to eventually replace them with lower-cardinality metrics `node_health_connectivity_latency_seconds` and `node_health_connectivity_status`. Metric Description: `node_health_connectivity_latency_seconds` is a histogram metric which records all valid latencies observed by the cluster `node_health_connectivity_status` is a gauge metric that returns the number of nodes/endpoints in each status type. Explanation of Lower Cardinality: Prior to this, the original node connectivity metrics were exported with source node and target node information, with an O(n^2) cardinality. In the updated versions, labels related to target cluster & node are removed to provide a higher-level overview of latency/status information and improve the scalability of this metric. The metrics use only the labels: - source node name - source cluster - type - status (for node_connectivity_status) Other changes: The corresponding output for `cilium-health status --succinct` has also been modified to represent the # of reachable endpoints / # of total endpoints. Signed-off-by: jshr-w <shjayaraman@microsoft.com>
5dd290a
to
aa13a7b
Compare
/test |
This PR deprecates the existing metrics
node_connectivity_latency_seconds
andnode_connectivity_status
, intending to eventually replace them with lower-cardinality metricsnode_health_connectivity_latency_seconds
andnode_health_connectivity_status
.Metric Description:
node_health_connectivity_latency_seconds
is a histogram metric which records all valid latencies observed by the clusternode_health_connectivity_status
is a gauge metric that returns the number of nodes/endpoints in each status type.Explanation of Lower Cardinality:
Prior to this, the original node connectivity metrics were exported with source node and target node information, with an O(n^2) cardinality. In the updated versions, labels related to target cluster & node are removed to provide a higher-level overview of latency/status information and improve the scalability of this metric.
The metrics use only the labels:
node_connectivity_status
)Other changes:
cilium-health status --succinct
has also been modified to represent the # of reachable endpoints / # of total endpoints.Relates to issue #32820.
Example of metrics, single cluster with 2 nodes.
On start-up:

After deleting one node:

Example of metrics seen on a node in a clustermesh with 2 clusters, each with two nodes.
On start-up, before all endpoints come up:

Once all endpoints are ready:

After deleting one cluster:

`cilium-health status --succinct` output
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.