Skip to content

Conversation

antonipp
Copy link
Contributor

@antonipp antonipp commented Jul 8, 2025

Description

This PR brings multiple improvements to the CES controller monitoring in the Operator.

Summary of changes:

  • Extract the client-go's workqueue#MetricsProvider into a separate Cell - this modularizes the MetricsProvider setup so that it can be injected into any Operator Cell leveraging Workqueues for easy monitoring. I did not touch the Agent code since it's a much heavier lift and can potentially be done separately later on. I also updated Histogram metrics emitted by the Provider, so that their buckets have more relevant values.
  • Add a MetricsProvider to both WorkQueues used by the CES Controller to export a variety of metrics to Prometheus (depth, adds, latency, work duration, unfinished work, longest running processor, retries)
  • Better histogram buckets for the number_of_cep_changes_per_ces metric. It used to use the default buckets {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} which really don't make sense here
  • Better labels for the ces_sync_total metric: when failures were reported by this metrics, it was impossible to tell whether they are transient failures which will be retried or if they are fatal errors (i.e. exhausted all retries). I also added a warning log for errors which are retried so that we can at least see them.
Cilium EndpointSlices: improve metrics from the Operator CES controller

@antonipp antonipp requested review from a team as code owners July 8, 2025 16:47
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 8, 2025
@antonipp
Copy link
Contributor Author

antonipp commented Jul 8, 2025

/test

@antonipp
Copy link
Contributor Author

antonipp commented Jul 9, 2025

/test

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One initial comment inline. Could you please also extend the commit messages with information on the changes, so that context is also available there in addition to the PR description?

(Edit: integration tests failures look legitimate)

@giorio94 giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. feature/ces Impacts the Cilium Endpoint Slice logic. labels Jul 9, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 9, 2025
Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc changes: Just one nit (see below), looks good to me otherwise. Thanks!

@antonipp antonipp requested a review from a team as a code owner July 9, 2025 14:15
@antonipp antonipp requested a review from tklauser July 9, 2025 14:15
@antonipp
Copy link
Contributor Author

antonipp commented Jul 9, 2025

/test

@antonipp
Copy link
Contributor Author

antonipp commented Jul 9, 2025

/test

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side, thanks!

@qmonnet
Copy link
Member

qmonnet commented Jul 9, 2025

/test
(Woops I thought the tests were not running but now I think I may have restarted them 😬 So sorry!)

@antonipp
Copy link
Contributor Author

antonipp commented Jul 9, 2025

/test

(linting problem, so re-running the tests again...)

@giorio94 giorio94 self-requested a review July 11, 2025 07:29
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A couple more minor comments inline.

@squeed squeed requested review from gandro and removed request for nathanjsweet July 11, 2025 09:37
Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for @cilium/operator, one question re. code ownership inline.

antonipp added 3 commits July 28, 2025 10:50
This modularizes the client-go Workqueue MetricsProvider setup so that it can be injected into any Cell leveraging Workqueues for easy monitoring.
The buckets for histograms are also updated with more relevant values.

Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
…queues

Leverage the WorkqueueMetricsProvider in order to report useful metrics for queues inside the CES controller (depth, adds, latency, work duration, unfinished work, longest running processor, retries)

Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
This metric used to use the default buckets {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} which really don't make sense here since they are too small

Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
@antonipp antonipp requested a review from a team as a code owner July 28, 2025 08:50
@antonipp antonipp requested a review from aanm July 28, 2025 08:50
@antonipp
Copy link
Contributor Author

/test

@antonipp
Copy link
Contributor Author

Sorry, was on vacation, now resuming work on this PR. I hopefully addressed the last comments.

antonipp added 2 commits July 28, 2025 13:32
When failures were reported by this metrics, it was impossible to tell whether they are transient failures which will be retried or if they are fatal errors (i.e. exhausted all retries)
I also added a warning log for errors which are retried so that we can at least see them.

Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
@antonipp
Copy link
Contributor Author

/test

@giorio94 giorio94 enabled auto-merge July 28, 2025 12:58
@giorio94 giorio94 added this pull request to the merge queue Jul 29, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 29, 2025
Merged via the queue into cilium:main with commit a2ee607 Jul 29, 2025
68 checks passed
Comment on lines +318 to +323
c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelOutcome, LabelValueOutcomeFail, LabelFailureType, LabelFailureTypeTransient).Inc()
} else {
c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelOutcome, LabelValueOutcomeFail, LabelFailureType, LabelFailureTypeFatal).Inc()
}
} else {
c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelValueOutcomeSuccess).Inc()
c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelValueOutcomeSuccess, "").Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the same metric be passed a different count of parameters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two other things to think about:

  1. How can we exercise such metrics to ensure that the codepaths are being hit in our test workflows
  2. Apparently these calls can cause the entire binary to panic, which seems highly risky to me. I wonder if we can be using alternate APIs that provide stronger guarantees around this.. it's making me a bit concerned in general about how we use metrics and whether we're mitigating the potential risks associated with panicking when metrics are incremented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cilium/metrics for awareness and consideration of the above topics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the same metric be passed a different count of parameters?

Good catch, probably happened during one of the intermediate iterations, when switching back to using WithLabelValues. @antonipp Could you please take care of a follow-up fix?

Apparently these calls can cause the entire binary to panic, which seems highly risky to me.

Not a metrics person, but to me panic'ing seems fine, because it is basically a bug, and the earlier we catch it during the development cycle, the better (I'm kind of surprised that we haven't already seen it in CI 😕).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the fix: #40817

Copy link
Member

@giorio94 giorio94 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panicking the cilium-agent due to a bug is not OK.

IMO it is appropriate to distinguish the type of bug. I totally agree with you that the agent should never panic because of e.g., missing validation, not checking error conditions, and things like that. But I personally think it still makes sense in these cases, realistically it seems unrealistic to me wrapping all of these calls with if err != nil { log(err) }, just to end-up catching them in CI in the very same way (but with a way more verbose and hard to read code). To me, it doesn't look much different from MustParseIP("1.2.3.4") (with the string hardcoded obviously, not user-provided), where there could be obviously a typo slipping through, but the likelihood is extremely low (arguably lower, but not totally impossible).

There's generally one place we accept panic, which is as part of config verification on startup.

There are ~370 occurrences of WithLabelValues (g grep WithLabelValues -- ':!vendor' ':!*_test.go' | wc -l), so there are at least two 😁. And a non-negligible number in vendored packages as well, such as controller-runtime. Joking aside, realistically, there will be always panics, otherwise we'd have bug free software (unless we wrap all goroutines with recover, but then we'd have other problems). And I personally struggle at seeing how checking an error that must not occur would help much avoiding that (as you can make a mistake writing the labels, you can forget checking the error).

We did see this in CI, but it was not reliable - most likely the code paths here are only sometimes hit.

Yes, I agree it was not fully reliable, but it still got caught in less then 12 hours, which seems fairly reasonable to me. It should ideally have been caught before merging the PR (either via reviews, and some blame is on me here, or via CI). If it were to be an error log, it would have ended up in exactly the same way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that doing error checking for all of these is cumbersome. This feels like something that we could statically analyze for. 🤔 I'd much prefer that over extra error checking boilerplate if it solves the same problem.

There are ~370 occurrences ...

This is the reason I got a bit nervous in this thread 😅 at least I had not been thinking about this failure condition. I think we've moved towards unconditionally registering all metrics on startup which should help to reduce the likelihood of hitting this, but metrics are inherently about conditionally incrementing some counters at runtime based on the logic of the program. So if we don't have static analyzers for the incrementing logic then I think we need to be particularly vigilant that the test coverage catches this.

We did catch this relatively quickly, though it was not particularly reliable - this exhibited in various different ways in different workflows, for instance in some cases it was only the "features check" at the end of the check that failed, because it tried to query something in the operator. I'd guess the failure rate <1% of individual test job runs hit it. There's two properties I'd like out of our metrics programming patterns if we could - one, to fail in a consistent way and two, when it fails it produces an understandable error condition. Any ideas we might have to improve these properties would be welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though it was not particularly reliable - this exhibited in various different ways in different workflows, for instance in some cases it was only the "features check" at the end of the check that failed, because it tried to query something in the operator.

Ah, yeah, I guess one aspect is that this metric was incremented in case the reconciliation failed, so we actually needed a failure to trigger that. The other is that the check of operator restarts is currently disabled, as it proved to be flaky in CI (as the operator can restart if it loses the leader election lease). We could probably make that logic a bit smarter to only ignore restarts caused by expected reasons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I didn't see a related issue so I filed #40858 to follow up on that case in particular.

I ended up writing a linter and it looks like we've got other code that is dynamically constructing the set of labels to pass to the metrics counter functions which seems like it opens up even further to the idea that cilium-agent could randomly crash due to runtime state. I'll follow up separately regarding that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related PR: #40863

@joestringer joestringer requested review from a team and derailed and removed request for a team July 29, 2025 22:35
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/ces Impacts the Cilium Endpoint Slice logic. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants