Cilium EndpointSlices: improve metrics from the Operator CES controller #40418

antonipp · 2025-07-08T16:47:05Z

Description

This PR brings multiple improvements to the CES controller monitoring in the Operator.

Summary of changes:

Extract the client-go's workqueue#MetricsProvider into a separate Cell - this modularizes the MetricsProvider setup so that it can be injected into any Operator Cell leveraging Workqueues for easy monitoring. I did not touch the Agent code since it's a much heavier lift and can potentially be done separately later on. I also updated Histogram metrics emitted by the Provider, so that their buckets have more relevant values.
Add a MetricsProvider to both WorkQueues used by the CES Controller to export a variety of metrics to Prometheus (depth, adds, latency, work duration, unfinished work, longest running processor, retries)
Better histogram buckets for the number_of_cep_changes_per_ces metric. It used to use the default buckets {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} which really don't make sense here
Better labels for the ces_sync_total metric: when failures were reported by this metrics, it was impossible to tell whether they are transient failures which will be retried or if they are fatal errors (i.e. exhausted all retries). I also added a warning log for errors which are retried so that we can at least see them.

Cilium EndpointSlices: improve metrics from the Operator CES controller

antonipp · 2025-07-08T16:47:35Z

/test

antonipp · 2025-07-09T08:12:13Z

/test

giorio94

Thanks! One initial comment inline. Could you please also extend the commit messages with information on the changes, so that context is also available there in addition to the PR description?

(Edit: integration tests failures look legitimate)

operator/pkg/ciliumendpointslice/workqueue_metrics_provider.go

qmonnet

Doc changes: Just one nit (see below), looks good to me otherwise. Thanks!

Documentation/observability/metrics.rst

antonipp · 2025-07-09T14:36:59Z

/test

antonipp · 2025-07-09T14:42:28Z

/test

qmonnet

Looks good from my side, thanks!

qmonnet · 2025-07-09T15:30:21Z

/test
(Woops I thought the tests were not running but now I think I may have restarted them 😬 So sorry!)

antonipp · 2025-07-09T16:54:04Z

/test

(linting problem, so re-running the tests again...)

giorio94

Thanks! A couple more minor comments inline.

operator/pkg/workqueuemetrics/metrics.go

operator/pkg/workqueuemetrics/provider.go

operator/pkg/ciliumendpointslice/testutils/no_op_workqueue_metrics_provider.go

operator/pkg/workqueuemetrics/cell.go

Documentation/observability/metrics.rst

operator/pkg/ciliumendpointslice/endpointslice.go

tklauser

LGTM for @cilium/operator, one question re. code ownership inline.

operator/pkg/workqueuemetrics/cell.go

This modularizes the client-go Workqueue MetricsProvider setup so that it can be injected into any Cell leveraging Workqueues for easy monitoring. The buckets for histograms are also updated with more relevant values. Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>

…queues Leverage the WorkqueueMetricsProvider in order to report useful metrics for queues inside the CES controller (depth, adds, latency, work duration, unfinished work, longest running processor, retries) Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>

This metric used to use the default buckets {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} which really don't make sense here since they are too small Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>

antonipp · 2025-07-28T08:53:32Z

/test

antonipp · 2025-07-28T08:54:11Z

Sorry, was on vacation, now resuming work on this PR. I hopefully addressed the last comments.

operator/pkg/ciliumendpointslice/endpointslice.go

When failures were reported by this metrics, it was impossible to tell whether they are transient failures which will be retried or if they are fatal errors (i.e. exhausted all retries) I also added a warning log for errors which are retried so that we can at least see them. Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>

Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>

antonipp · 2025-07-28T11:33:33Z

/test

joestringer · 2025-07-29T22:30:06Z

operator/pkg/ciliumendpointslice/endpointslice.go

+			c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelOutcome, LabelValueOutcomeFail, LabelFailureType, LabelFailureTypeTransient).Inc()
+		} else {
+			c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelOutcome, LabelValueOutcomeFail, LabelFailureType, LabelFailureTypeFatal).Inc()
+		}
 	} else {
-		c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelValueOutcomeSuccess).Inc()
+		c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelValueOutcomeSuccess, "").Inc()


Should the same metric be passed a different count of parameters?

Two other things to think about:

How can we exercise such metrics to ensure that the codepaths are being hit in our test workflows

Apparently these calls can cause the entire binary to panic, which seems highly risky to me. I wonder if we can be using alternate APIs that provide stronger guarantees around this.. it's making me a bit concerned in general about how we use metrics and whether we're mitigating the potential risks associated with panicking when metrics are incremented.

cc @cilium/metrics for awareness and consideration of the above topics.

Should the same metric be passed a different count of parameters?

Good catch, probably happened during one of the intermediate iterations, when switching back to using WithLabelValues. @antonipp Could you please take care of a follow-up fix?

Apparently these calls can cause the entire binary to panic, which seems highly risky to me.

Not a metrics person, but to me panic'ing seems fine, because it is basically a bug, and the earlier we catch it during the development cycle, the better (I'm kind of surprised that we haven't already seen it in CI 😕).

Here is the fix: #40817

panicking the cilium-agent due to a bug is not OK.

IMO it is appropriate to distinguish the type of bug. I totally agree with you that the agent should never panic because of e.g., missing validation, not checking error conditions, and things like that. But I personally think it still makes sense in these cases, realistically it seems unrealistic to me wrapping all of these calls with if err != nil { log(err) }, just to end-up catching them in CI in the very same way (but with a way more verbose and hard to read code). To me, it doesn't look much different from MustParseIP("1.2.3.4") (with the string hardcoded obviously, not user-provided), where there could be obviously a typo slipping through, but the likelihood is extremely low (arguably lower, but not totally impossible).

There's generally one place we accept panic, which is as part of config verification on startup.

There are ~370 occurrences of WithLabelValues (g grep WithLabelValues -- ':!vendor' ':!*_test.go' | wc -l), so there are at least two 😁. And a non-negligible number in vendored packages as well, such as controller-runtime. Joking aside, realistically, there will be always panics, otherwise we'd have bug free software (unless we wrap all goroutines with recover, but then we'd have other problems). And I personally struggle at seeing how checking an error that must not occur would help much avoiding that (as you can make a mistake writing the labels, you can forget checking the error).

We did see this in CI, but it was not reliable - most likely the code paths here are only sometimes hit.

Yes, I agree it was not fully reliable, but it still got caught in less then 12 hours, which seems fairly reasonable to me. It should ideally have been caught before merging the PR (either via reviews, and some blame is on me here, or via CI). If it were to be an error log, it would have ended up in exactly the same way.

I agree that doing error checking for all of these is cumbersome. This feels like something that we could statically analyze for. 🤔 I'd much prefer that over extra error checking boilerplate if it solves the same problem.

There are ~370 occurrences ...

This is the reason I got a bit nervous in this thread 😅 at least I had not been thinking about this failure condition. I think we've moved towards unconditionally registering all metrics on startup which should help to reduce the likelihood of hitting this, but metrics are inherently about conditionally incrementing some counters at runtime based on the logic of the program. So if we don't have static analyzers for the incrementing logic then I think we need to be particularly vigilant that the test coverage catches this.

We did catch this relatively quickly, though it was not particularly reliable - this exhibited in various different ways in different workflows, for instance in some cases it was only the "features check" at the end of the check that failed, because it tried to query something in the operator. I'd guess the failure rate <1% of individual test job runs hit it. There's two properties I'd like out of our metrics programming patterns if we could - one, to fail in a consistent way and two, when it fails it produces an understandable error condition. Any ideas we might have to improve these properties would be welcome.

though it was not particularly reliable - this exhibited in various different ways in different workflows, for instance in some cases it was only the "features check" at the end of the check that failed, because it tried to query something in the operator.

Ah, yeah, I guess one aspect is that this metric was incremented in case the reconciliation failed, so we actually needed a failure to trigger that. The other is that the check of operator restarts is currently disabled, as it proved to be flaky in CI (as the operator can restart if it loses the leader election lease). We could probably make that logic a bit smarter to only ignore restarts caused by expected reasons.

I see. I didn't see a related issue so I filed #40858 to follow up on that case in particular.

I ended up writing a linter and it looks like we've got other code that is dynamically constructing the set of labels to pass to the metrics counter functions which seems like it opens up even further to the idea that cilium-agent could randomly crash due to runtime state. I'll follow up separately regarding that.

Related PR: #40863

antonipp requested review from a team as code owners July 8, 2025 16:47

antonipp requested review from giorio94, nathanjsweet and qmonnet July 8, 2025 16:47

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 8, 2025

antonipp force-pushed the ai/ces-metrics branch from 0540d75 to 8d88148 Compare July 9, 2025 08:09

giorio94 reviewed Jul 9, 2025

View reviewed changes

operator/pkg/ciliumendpointslice/workqueue_metrics_provider.go Outdated Show resolved Hide resolved

giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. feature/ces Impacts the Cilium Endpoint Slice logic. labels Jul 9, 2025

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 9, 2025

qmonnet requested changes Jul 9, 2025

View reviewed changes

Documentation/observability/metrics.rst Outdated Show resolved Hide resolved

antonipp force-pushed the ai/ces-metrics branch from 8d88148 to 9ea95ac Compare July 9, 2025 14:15

antonipp requested a review from a team as a code owner July 9, 2025 14:15

antonipp requested a review from tklauser July 9, 2025 14:15

antonipp force-pushed the ai/ces-metrics branch from 9ea95ac to e174ce0 Compare July 9, 2025 14:27

antonipp force-pushed the ai/ces-metrics branch from e174ce0 to 1fde411 Compare July 9, 2025 14:39

qmonnet approved these changes Jul 9, 2025

View reviewed changes

antonipp force-pushed the ai/ces-metrics branch from 1fde411 to b049060 Compare July 9, 2025 16:53

giorio94 self-requested a review July 11, 2025 07:29

giorio94 reviewed Jul 11, 2025

View reviewed changes

squeed requested review from gandro and removed request for nathanjsweet July 11, 2025 09:37

tklauser approved these changes Jul 15, 2025

View reviewed changes

operator/pkg/workqueuemetrics/cell.go Show resolved Hide resolved

antonipp added 3 commits July 28, 2025 10:50

CES: Better histogram buckets for number_of_cep_changes_per_ces

4f1c961

This metric used to use the default buckets {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} which really don't make sense here since they are too small Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>

antonipp force-pushed the ai/ces-metrics branch from a254180 to fc86f1c Compare July 28, 2025 08:50

antonipp requested a review from a team as a code owner July 28, 2025 08:50

antonipp requested a review from aanm July 28, 2025 08:50

giorio94 approved these changes Jul 28, 2025

View reviewed changes

operator/pkg/ciliumendpointslice/endpointslice.go Outdated Show resolved Hide resolved

antonipp added 2 commits July 28, 2025 13:32

docs: update documentation for Workqueue and CES metrics

e59fa07

Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>

antonipp force-pushed the ai/ces-metrics branch from fc86f1c to e59fa07 Compare July 28, 2025 11:33

giorio94 enabled auto-merge July 28, 2025 12:58

aanm approved these changes Jul 29, 2025

View reviewed changes

giorio94 added this pull request to the merge queue Jul 29, 2025

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 29, 2025

Merged via the queue into cilium:main with commit a2ee607 Jul 29, 2025
68 checks passed

joestringer mentioned this pull request Jul 29, 2025

CI: inconsistent label cardinality: expected 2 label values but got 4 in []string{"outcome", "fail", "failure_type", "transient"} #40801

Closed

joestringer reviewed Jul 29, 2025

View reviewed changes

joestringer requested review from a team and derailed and removed request for a team July 29, 2025 22:35

maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 29, 2025

antonipp mentioned this pull request Jul 30, 2025

Cilium EndpointSlices: fix label values #40817

Merged

vipul-21 mentioned this pull request Jul 30, 2025

fix: create policy snapshot only for sdp #40785

Merged

joestringer mentioned this pull request Aug 1, 2025

Add linter for metrics parameter matching #40863

Merged

antonipp mentioned this pull request Aug 8, 2025

[1.17] Backport CES metric changes DataDog/cilium#617

Merged

Cilium EndpointSlices: improve metrics from the Operator CES controller #40418

Cilium EndpointSlices: improve metrics from the Operator CES controller #40418

Uh oh!

Conversation

antonipp commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

antonipp commented Jul 8, 2025

Uh oh!

antonipp commented Jul 9, 2025

Uh oh!

giorio94 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qmonnet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antonipp commented Jul 9, 2025

Uh oh!

antonipp commented Jul 9, 2025

Uh oh!

qmonnet left a comment

Choose a reason for hiding this comment

Uh oh!

qmonnet commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antonipp commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giorio94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tklauser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antonipp commented Jul 28, 2025

Uh oh!

antonipp commented Jul 28, 2025

Uh oh!

Uh oh!

antonipp commented Jul 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giorio94 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antonipp commented Jul 8, 2025 •

edited

Loading

giorio94 left a comment •

edited

Loading

qmonnet commented Jul 9, 2025 •

edited

Loading

antonipp commented Jul 9, 2025 •

edited

Loading

giorio94 Jul 30, 2025 •

edited

Loading