Skip to content

Conversation

derailed
Copy link
Contributor

@derailed derailed commented Oct 3, 2023

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

When a node is deleted from a cluster, metrics associated with that node are still being exported to prometheus.
Short of restarting the agent, we want to dynamically delete these metrics when a node is removed from the cluster.

This PR ensures node_connectivity_status and node_connectivity_latency no longer report metrics for nodes that are no longer present on the cluster.

Fixes: #issue-number

Cilium now properly deletes stale (deleted) nodes from the node_connectivity_status and node_connectivity_latency_seconds metrics, reducing metric cardinality.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 3, 2023
@derailed
Copy link
Contributor Author

derailed commented Oct 3, 2023

/test

@derailed derailed marked this pull request as ready for review October 4, 2023 00:08
@derailed derailed requested review from a team as code owners October 4, 2023 00:08
@derailed derailed requested review from ldelossa and chancez October 4, 2023 00:08
…er reported.

When a node is deleted from a cluster, metrics associated with that node
are still being exported to prometheus. Short of restarting the agent,
we want to dynamically delete these metrics when a node is removed from the cluster.

This PR ensures node_connectivity_status and node_connectivity_latency
no longer report metrics for nodes that are no longer present on the
cluster.

Signed-off-by: Fernand Galiana <fernand.galiana@isovalent.com>
@derailed derailed force-pushed the issue/node_metrics_delete branch from 681c5d5 to ce9b145 Compare October 4, 2023 20:25
@derailed
Copy link
Contributor Author

derailed commented Oct 4, 2023

/test

@christarazi christarazi added kind/bug This is a bug in the Cilium logic. kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. labels Oct 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 4, 2023
@christarazi
Copy link
Member

christarazi commented Oct 4, 2023

Just a minor nit on the release note

When a node is deleted from a cluster, we must ensure prometheus metrics associated with deleted node are no longer reported. Notably: node_connectivity_status and node_connectivity_latency_seconds.

Typically we frame release notes by describing the impact, so something like

Cilium now properly deletes stale (deleted) nodes from the node_connectivity_status and node_connectivity_latency_seconds metrics, reducing metric cardinality.

@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. area/health Relates to the cilium-health component labels Oct 4, 2023
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, there's already a differentiation for metrics that can be deleted, so this PR is just following the pattern. IMO, we can merge this now to fix the ongoing problems with the node connectivity metrics, and then followup with a refactor.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 11, 2023
@squeed
Copy link
Contributor

squeed commented Oct 12, 2023

Checkpatch is complaining that the commit subject line is too long. Overridden.

@squeed squeed merged commit e9f97cd into cilium:main Oct 12, 2023
@mikejennings
Copy link

did this make it into the newest release?

@derailed derailed added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels Nov 3, 2023
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Nov 8, 2023
@christarazi christarazi added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 labels Nov 8, 2023
@christarazi christarazi added the backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. label Nov 8, 2023
@github-actions github-actions bot added the backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. label Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/health Relates to the cilium-health component area/metrics Impacts statistics / metrics gathering, eg via Prometheus. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
No open projects
Status: Released
Status: Released
Development

Successfully merging this pull request may close these issues.

6 participants