-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fix cilium dgb health v1.16 #37463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cilium dgb health v1.16 #37463
Conversation
[ upstream commit fee6107 ] This PR deprecates the existing metrics `node_connectivity_latency_seconds` and `node_connectivity_status`, intending to eventually replace them with lower-cardinality metrics `node_health_connectivity_latency_seconds` and `node_health_connectivity_status`. Metric Description: `node_health_connectivity_latency_seconds` is a histogram metric which records all valid latencies observed by the cluster `node_health_connectivity_status` is a gauge metric that returns the number of nodes/endpoints in each status type. Explanation of Lower Cardinality: Prior to this, the original node connectivity metrics were exported with source node and target node information, with an O(n^2) cardinality. In the updated versions, labels related to target cluster & node are removed to provide a higher-level overview of latency/status information and improve the scalability of this metric. The metrics use only the labels: - source node name - source cluster - type - status (for node_connectivity_status) Other changes: The corresponding output for `cilium-health status --succinct` has also been modified to represent the # of reachable endpoints / # of total endpoints. Signed-off-by: jshr-w <shjayaraman@microsoft.com> Signed-off-by: darox <maderdario@gmail.com>
[ upstream commit bf7bc43 ] This commit fixes several issues that were reported in the status commands of cilium-dbg. - Fix the issue where not up to 10 unreachable nodes were shown in the status command. - Fix the issue where `--all-nodes` didn't return all nodes regardless of their status. - `--all-health` now returns the health of all modules regardless of their status. This can be paired with `--all-nodes` to get the health of all nodes. - `--verbose` now shows the latency to nodes and endpoints. Signed-off-by: darox <maderdario@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to also fix main? Usually we change main first and then backport to older branches.
I didn't review in depth, just probing the high level question of how much we think is appropriate to change/fix in v1.16 now that v1.17.0 is out. There was previously some discussions on metrics process here: #33995, and the backport criteria is here: https://docs.cilium.io/en/stable/contributing/release/backports/#id1
Deprecated Metrics | ||
~~~~~~~~~~~~~~~~~~ | ||
* ``cilium_node_connectivity_status`` is now deprecated. Please use ``cilium_node_health_connectivity_status`` instead. | ||
* ``cilium_node_connectivity_latency_seconds`` is now deprecated. Please use ``cilium_node_health_connectivity_latency_seconds`` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: I think it's a bit surprising to users to deprecate some metrics six months after release.
I will close this one. See discussion in #33972 (comment) |
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Fixes: #issue-number