Skip to content

metrics: reduce cardinality of nodeconnectivity metrics #33103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 26, 2024

Conversation

jshr-w
Copy link
Contributor

@jshr-w jshr-w commented Jun 12, 2024

This PR deprecates the existing metrics node_connectivity_latency_seconds and node_connectivity_status, intending to eventually replace them with lower-cardinality metrics node_health_connectivity_latency_seconds and node_health_connectivity_status.

Metric Description:

  1. node_health_connectivity_latency_seconds is a histogram metric which records all valid latencies observed by the cluster
  2. node_health_connectivity_status is a gauge metric that returns the number of nodes/endpoints in each status type.

Explanation of Lower Cardinality:
Prior to this, the original node connectivity metrics were exported with source node and target node information, with an O(n^2) cardinality. In the updated versions, labels related to target cluster & node are removed to provide a higher-level overview of latency/status information and improve the scalability of this metric.

The metrics use only the labels:

  • source node name
  • source cluster
  • type
  • status (for node_connectivity_status)

Other changes:

  • The corresponding output for cilium-health status --succinct has also been modified to represent the # of reachable endpoints / # of total endpoints.

Relates to issue #32820.

Example of metrics, single cluster with 2 nodes.

On start-up:
Screenshot 2024-08-22 192728

After deleting one node:
Screenshot 2024-08-22 192833

Example of metrics seen on a node in a clustermesh with 2 clusters, each with two nodes.

On start-up, before all endpoints come up:
Screenshot 2024-08-22 194902

Once all endpoints are ready:
Screenshot 2024-08-22 194907

After deleting one cluster:
Screenshot 2024-08-22 195445

`cilium-health status --succinct` output

Screenshot 2024-08-16 164928

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!
metrics: deprecate node_connectivity metrics in favor of new lower-cardinality node_health_connectivity metrics

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 12, 2024
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jun 12, 2024
@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch from 0e20555 to 3a934d5 Compare June 12, 2024 19:04
@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch 4 times, most recently from aceafc2 to cfb6d8f Compare June 20, 2024 01:56
@jshr-w jshr-w marked this pull request as ready for review June 20, 2024 02:01
@jshr-w jshr-w requested review from a team as code owners June 20, 2024 02:01
@qmonnet qmonnet added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jun 20, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 20, 2024
Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc change looks good; I only glanced quickly over the rest.

I'd recommend adding the motivation for the change to your commit description and PR description. It's not clear why we reduce the cardinality when reading the commit log; we have to go and check the related issue. I'd rather have the motivation for the patch clearly mentioned here.

@qmonnet

This comment was marked as outdated.

@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch 2 times, most recently from b8f1baa to 5f549a9 Compare June 20, 2024 19:36
@qmonnet qmonnet added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jun 20, 2024
@qmonnet

This comment was marked as outdated.

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, only small nits :)

@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch 2 times, most recently from 38b29b1 to d81bff4 Compare June 26, 2024 15:45
@jshr-w jshr-w requested a review from marseel June 26, 2024 15:47
@jshr-w
Copy link
Contributor Author

jshr-w commented Aug 22, 2024

Force push for constant timeout diff:

diff --git a/pkg/health/server/server.go b/pkg/health/server/server.go
index b4764fca53..595bc6a79d 100644
--- a/pkg/health/server/server.go
+++ b/pkg/health/server/server.go
`@@ -70,6 +70,10 @@ type Server struct {
        localStatus  *healthModels.SelfStatus
 }
 
+const (
+       probeTimeout = 60
+)
+
 // DumpUptime returns the time that this server has been running.
 func (s *Server) DumpUptime() string {
        return time.Since(s.startTime).String()
@@ -317,7 +321,7 @@ func collectConnectivityMetric(status *healthModels.ConnectivityStatus, labels .
                        metrics.NodeConnectivityLatency.WithLabelValues(labels...).Set(metricValue)
                        metrics.NodeHealthConnectivityLatency.WithLabelValues(healthLabels...).Observe(metricValue)
                } else {
-                       metrics.NodeHealthConnectivityLatency.WithLabelValues(healthLabels...).Observe(60)
+                       metrics.NodeHealthConnectivityLatency.WithLabelValues(healthLabels...).Observe(probeTimeout)
                }
        } else {
                metrics.NodeConnectivityLatency.WithLabelValues(labels...).Set(-1)

@pchaigno
Copy link
Member

/test

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, expect for that lgtm 🎉

@maintainer-s-little-helper

This comment was marked as resolved.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 23, 2024
@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch from 86d2e79 to 6be5857 Compare August 23, 2024 03:00
@maintainer-s-little-helper

This comment was marked as resolved.

@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch from 6be5857 to 72ff857 Compare August 23, 2024 03:02
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 23, 2024
@jshr-w jshr-w requested a review from marseel August 23, 2024 03:02
@jshr-w
Copy link
Contributor Author

jshr-w commented Aug 23, 2024

/test

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one hopefully last blocking comment, I must have missed that on the previous review (sorry!)

@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch 2 times, most recently from 4263608 to cfb1970 Compare August 23, 2024 22:19
@jshr-w
Copy link
Contributor Author

jshr-w commented Aug 23, 2024

/test

@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch from ef1008d to 5dd290a Compare August 24, 2024 23:29
@jshr-w
Copy link
Contributor Author

jshr-w commented Aug 24, 2024

/test

1 similar comment
@jshr-w
Copy link
Contributor Author

jshr-w commented Aug 25, 2024

/test

Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks for the changes 🙏

Copy link
Contributor

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

This PR deprecates the existing metrics `node_connectivity_latency_seconds` and `node_connectivity_status`, intending to eventually replace them with lower-cardinality metrics `node_health_connectivity_latency_seconds` and `node_health_connectivity_status`.

Metric Description:

`node_health_connectivity_latency_seconds` is a histogram metric which records all valid latencies observed by the cluster
`node_health_connectivity_status` is a gauge metric that returns the number of nodes/endpoints in each status type.

Explanation of Lower Cardinality:
Prior to this, the original node connectivity metrics were exported with source node and target node information, with an O(n^2) cardinality. In the updated versions, labels related to target cluster & node are removed to provide a higher-level overview of latency/status information and improve the scalability of this metric.

The metrics use only the labels:

- source node name
- source cluster
- type
- status (for node_connectivity_status)

Other changes:

The corresponding output for `cilium-health status --succinct` has also been modified to represent the # of reachable endpoints / # of total endpoints.

Signed-off-by: jshr-w <shjayaraman@microsoft.com>
@jshr-w jshr-w force-pushed the jshr/healthmetriccardinality branch from 5dd290a to aa13a7b Compare August 26, 2024 14:29
@jshr-w
Copy link
Contributor Author

jshr-w commented Aug 26, 2024

/test

@pchaigno pchaigno enabled auto-merge August 26, 2024 18:05
@pchaigno pchaigno added this pull request to the merge queue Aug 26, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 26, 2024
Merged via the queue into cilium:main with commit fee6107 Aug 26, 2024
68 checks passed
@christarazi christarazi added the affects/v1.16 This issue affects v1.16 branch label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.16 This issue affects v1.16 branch kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants