Skip to content

Conversation

darox
Copy link
Contributor

@darox darox commented Feb 6, 2025

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Fixes: #issue-number

<!-- Enter the release note text here if needed or remove this section! -->

jshr-w and others added 2 commits February 6, 2025 09:37
[ upstream commit fee6107 ]

This PR deprecates the existing metrics `node_connectivity_latency_seconds` and `node_connectivity_status`, intending to eventually replace them with lower-cardinality metrics `node_health_connectivity_latency_seconds` and `node_health_connectivity_status`.

Metric Description:

`node_health_connectivity_latency_seconds` is a histogram metric which records all valid latencies observed by the cluster
`node_health_connectivity_status` is a gauge metric that returns the number of nodes/endpoints in each status type.

Explanation of Lower Cardinality:
Prior to this, the original node connectivity metrics were exported with source node and target node information, with an O(n^2) cardinality. In the updated versions, labels related to target cluster & node are removed to provide a higher-level overview of latency/status information and improve the scalability of this metric.

The metrics use only the labels:

- source node name
- source cluster
- type
- status (for node_connectivity_status)

Other changes:

The corresponding output for `cilium-health status --succinct` has also been modified to represent the # of reachable endpoints / # of total endpoints.

Signed-off-by: jshr-w <shjayaraman@microsoft.com>
Signed-off-by: darox <maderdario@gmail.com>
[ upstream commit bf7bc43 ]

This commit fixes several issues that were
reported in the status commands of cilium-dbg.

- Fix the issue where not up to 10 unreachable
  nodes were shown in the status command.
- Fix the issue where `--all-nodes` didn't
  return all nodes regardless of their status.
- `--all-health` now returns the health of all
  modules regardless of their status. This can be
  paired with `--all-nodes` to get the health of
  all nodes.
- `--verbose` now shows the latency to nodes and
  endpoints.

Signed-off-by: darox <maderdario@gmail.com>
@darox darox requested a review from a team as a code owner February 6, 2025 09:58
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.16 This PR represents a backport for Cilium 1.16.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Feb 6, 2025
@darox darox marked this pull request as draft February 6, 2025 09:59
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to also fix main? Usually we change main first and then backport to older branches.

I didn't review in depth, just probing the high level question of how much we think is appropriate to change/fix in v1.16 now that v1.17.0 is out. There was previously some discussions on metrics process here: #33995, and the backport criteria is here: https://docs.cilium.io/en/stable/contributing/release/backports/#id1

Deprecated Metrics
~~~~~~~~~~~~~~~~~~
* ``cilium_node_connectivity_status`` is now deprecated. Please use ``cilium_node_health_connectivity_status`` instead.
* ``cilium_node_connectivity_latency_seconds`` is now deprecated. Please use ``cilium_node_health_connectivity_latency_seconds`` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: I think it's a bit surprising to users to deprecate some metrics six months after release.

@darox
Copy link
Contributor Author

darox commented Feb 7, 2025

Do we need to also fix main? Usually we change main first and then backport to older branches.

I didn't review in depth, just probing the high level question of how much we think is appropriate to change/fix in v1.16 now that v1.17.0 is out. There was previously some discussions on metrics process here: #33995, and the backport criteria is here: https://docs.cilium.io/en/stable/contributing/release/backports/#id1

I will close this one. See discussion in #33972 (comment)

@darox darox closed this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.16 This PR represents a backport for Cilium 1.16.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants