Skip to content

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Aug 2, 2025

The way we registered the k8s global work-queue metrics has always been prone to breaking, as it depended on the specific ordering of evaluation of init functions, as is explained in this comment:

// The client-go Register function can be called only once,
// but there's a possibility that another package calls it and
// registers the client-go metrics on its own registry.
// The metrics of one who called the init first will take effect,
// and which one wins depends on the order of package initialization.
// Currently, controller-runtime also calls it in its init function
// because there's an indirect dependency on controller-runtime.
// Given the possibility that controller-runtime wins, we should set
// the adapters directly to override the metrics registration of
// controller-runtime as well as calling Register function.
// Without calling Register function here, controller-runtime will have
// a chance to override our metrics registration.

This was already causing problems in the Operator resulting in missing metrics when trying to move the watchers/metrics.go file to a separate package. Clearly trying to trick Go into evaluating our init funcs before a code libraries wasn't a good approach moving forward.

As a result, recent changes have moved operator related workqueue configs to explicitly define the provider via a global variable. This fixes the problem of the init() registry race condition but it results in some Operator workqueue metrics now being prefixed with the agent metrics namespace (i.e. cilium_). Previously this was broken as well as because metrics registered via the global registry were not registered with the cilium prefix many such metrics were missing the cilium_ prefix, leading to inconsistent metric naming.

This follows up those changes and takes another step in cleaning all this up by removing the global MetricsProvider and explicitly passing a Hive provided one to each workqueue (i.e. notably, many of the resource watchers use these to setup K8s object events). The result is a wide refactor requiring a lot of fixed tests and addition of MetricsProvider to a lot of Hive constructor parameters - thus the separate pull request.

https://github.com/cilium/cilium/pull/40884/files#diff-8b8e02278a63139306486adc431761b5b2dd9847df5559f95acd90d9a921fb6bL44-L49

Fix operator k8s workqueue metrics to use correct prefix of cilium_operator_workqueue_<metric> 

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 2, 2025
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/remove-global-metrics-provider branch from 9e4402e to acf821d Compare August 8, 2025 21:47
@tommyp1ckles tommyp1ckles changed the title Pr/tp/remove global metrics provider Remove global metrics provider and unify workqueue metric naming Aug 8, 2025
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/remove-global-metrics-provider branch from acf821d to e481aa5 Compare August 15, 2025 03:56
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/remove-global-metrics-provider branch from e481aa5 to 1a7e1ed Compare August 15, 2025 03:59
@tommyp1ckles tommyp1ckles marked this pull request as ready for review August 15, 2025 03:59
@tommyp1ckles tommyp1ckles requested review from a team as code owners August 15, 2025 03:59
@tommyp1ckles tommyp1ckles added the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Aug 20, 2025
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/remove-global-metrics-provider branch from 1a7e1ed to 7526fee Compare August 20, 2025 01:02
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

@bimmlerd

One question - would it make sense to add the metrics provider cell to cilium's pkg/hive? That would I think replace the second commit for all tests which don't use upstream hive.

Yeah, great idea, changed it to this and was able to drop all the test refactoring.

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/remove-global-metrics-provider branch from 7526fee to aaa1cab Compare August 20, 2025 01:04
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Aug 20, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 20, 2025
@tommyp1ckles tommyp1ckles added area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/operator Impacts the cilium-operator component dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 20, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 20, 2025
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Aug 20, 2025
This global provider was always problematic:

* Caused operator metrics to use the Agent cilium_* namespace.
* Global register was brittle and relied on implicit ordering
	of init() functions.

This removes the global provider and instead opts to provide
one via hive.

As well, any k8s watcher workqueue configs without a explicit
config (i.e. relying on the global RegisterProvider) will now
use the provided provider.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
These are now fixed to always be under the correct namespace.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/remove-global-metrics-provider branch from aaa1cab to 98d3391 Compare August 20, 2025 18:53
@tommyp1ckles
Copy link
Contributor Author

/test

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my codeowners.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 28, 2025
@tommyp1ckles tommyp1ckles enabled auto-merge August 28, 2025 16:53
@tommyp1ckles tommyp1ckles added this pull request to the merge queue Aug 28, 2025
Merged via the queue into cilium:main with commit 2162062 Aug 28, 2025
67 of 68 checks passed
@tommyp1ckles tommyp1ckles deleted the pr/tp/remove-global-metrics-provider branch August 28, 2025 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/operator Impacts the cilium-operator component ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants