-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Cilium EndpointSlices: improve metrics from the Operator CES controller #40418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/test |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! One initial comment inline. Could you please also extend the commit messages with information on the changes, so that context is also available there in addition to the PR description?
(Edit: integration tests failures look legitimate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc changes: Just one nit (see below), looks good to me otherwise. Thanks!
/test |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my side, thanks!
/test |
/test (linting problem, so re-running the tests again...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A couple more minor comments inline.
operator/pkg/ciliumendpointslice/testutils/no_op_workqueue_metrics_provider.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for @cilium/operator, one question re. code ownership inline.
This modularizes the client-go Workqueue MetricsProvider setup so that it can be injected into any Cell leveraging Workqueues for easy monitoring. The buckets for histograms are also updated with more relevant values. Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
…queues Leverage the WorkqueueMetricsProvider in order to report useful metrics for queues inside the CES controller (depth, adds, latency, work duration, unfinished work, longest running processor, retries) Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
This metric used to use the default buckets {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} which really don't make sense here since they are too small Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
/test |
Sorry, was on vacation, now resuming work on this PR. I hopefully addressed the last comments. |
When failures were reported by this metrics, it was impossible to tell whether they are transient failures which will be retried or if they are fatal errors (i.e. exhausted all retries) I also added a warning log for errors which are retried so that we can at least see them. Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
/test |
c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelOutcome, LabelValueOutcomeFail, LabelFailureType, LabelFailureTypeTransient).Inc() | ||
} else { | ||
c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelOutcome, LabelValueOutcomeFail, LabelFailureType, LabelFailureTypeFatal).Inc() | ||
} | ||
} else { | ||
c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelValueOutcomeSuccess).Inc() | ||
c.metrics.CiliumEndpointSliceSyncTotal.WithLabelValues(LabelValueOutcomeSuccess, "").Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the same metric be passed a different count of parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two other things to think about:
- How can we exercise such metrics to ensure that the codepaths are being hit in our test workflows
- Apparently these calls can cause the entire binary to panic, which seems highly risky to me. I wonder if we can be using alternate APIs that provide stronger guarantees around this.. it's making me a bit concerned in general about how we use metrics and whether we're mitigating the potential risks associated with panicking when metrics are incremented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cilium/metrics for awareness and consideration of the above topics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the same metric be passed a different count of parameters?
Good catch, probably happened during one of the intermediate iterations, when switching back to using WithLabelValues
. @antonipp Could you please take care of a follow-up fix?
Apparently these calls can cause the entire binary to panic, which seems highly risky to me.
Not a metrics person, but to me panic'ing seems fine, because it is basically a bug, and the earlier we catch it during the development cycle, the better (I'm kind of surprised that we haven't already seen it in CI 😕).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the fix: #40817
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panicking the cilium-agent due to a bug is not OK.
IMO it is appropriate to distinguish the type of bug. I totally agree with you that the agent should never panic because of e.g., missing validation, not checking error conditions, and things like that. But I personally think it still makes sense in these cases, realistically it seems unrealistic to me wrapping all of these calls with if err != nil { log(err) }
, just to end-up catching them in CI in the very same way (but with a way more verbose and hard to read code). To me, it doesn't look much different from MustParseIP("1.2.3.4")
(with the string hardcoded obviously, not user-provided), where there could be obviously a typo slipping through, but the likelihood is extremely low (arguably lower, but not totally impossible).
There's generally one place we accept panic, which is as part of config verification on startup.
There are ~370 occurrences of WithLabelValues
(g grep WithLabelValues -- ':!vendor' ':!*_test.go' | wc -l
), so there are at least two 😁. And a non-negligible number in vendored packages as well, such as controller-runtime. Joking aside, realistically, there will be always panics, otherwise we'd have bug free software (unless we wrap all goroutines with recover
, but then we'd have other problems). And I personally struggle at seeing how checking an error that must not occur would help much avoiding that (as you can make a mistake writing the labels, you can forget checking the error).
We did see this in CI, but it was not reliable - most likely the code paths here are only sometimes hit.
Yes, I agree it was not fully reliable, but it still got caught in less then 12 hours, which seems fairly reasonable to me. It should ideally have been caught before merging the PR (either via reviews, and some blame is on me here, or via CI). If it were to be an error log, it would have ended up in exactly the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that doing error checking for all of these is cumbersome. This feels like something that we could statically analyze for. 🤔 I'd much prefer that over extra error checking boilerplate if it solves the same problem.
There are ~370 occurrences ...
This is the reason I got a bit nervous in this thread 😅 at least I had not been thinking about this failure condition. I think we've moved towards unconditionally registering all metrics on startup which should help to reduce the likelihood of hitting this, but metrics are inherently about conditionally incrementing some counters at runtime based on the logic of the program. So if we don't have static analyzers for the incrementing logic then I think we need to be particularly vigilant that the test coverage catches this.
We did catch this relatively quickly, though it was not particularly reliable - this exhibited in various different ways in different workflows, for instance in some cases it was only the "features check" at the end of the check that failed, because it tried to query something in the operator. I'd guess the failure rate <1% of individual test job runs hit it. There's two properties I'd like out of our metrics programming patterns if we could - one, to fail in a consistent way and two, when it fails it produces an understandable error condition. Any ideas we might have to improve these properties would be welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though it was not particularly reliable - this exhibited in various different ways in different workflows, for instance in some cases it was only the "features check" at the end of the check that failed, because it tried to query something in the operator.
Ah, yeah, I guess one aspect is that this metric was incremented in case the reconciliation failed, so we actually needed a failure to trigger that. The other is that the check of operator restarts is currently disabled, as it proved to be flaky in CI (as the operator can restart if it loses the leader election lease). We could probably make that logic a bit smarter to only ignore restarts caused by expected reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I didn't see a related issue so I filed #40858 to follow up on that case in particular.
I ended up writing a linter and it looks like we've got other code that is dynamically constructing the set of labels to pass to the metrics counter functions which seems like it opens up even further to the idea that cilium-agent could randomly crash due to runtime state. I'll follow up separately regarding that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related PR: #40863
Description
This PR brings multiple improvements to the CES controller monitoring in the Operator.
Summary of changes:
client-go
's workqueue#MetricsProvider into a separate Cell - this modularizes theMetricsProvider
setup so that it can be injected into any Operator Cell leveraging Workqueues for easy monitoring. I did not touch the Agent code since it's a much heavier lift and can potentially be done separately later on. I also updated Histogram metrics emitted by the Provider, so that their buckets have more relevant values.MetricsProvider
to both WorkQueues used by the CES Controller to export a variety of metrics to Prometheus (depth, adds, latency, work duration, unfinished work, longest running processor, retries)number_of_cep_changes_per_ces
metric. It used to use the default buckets{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
which really don't make sense hereces_sync_total
metric: when failures were reported by this metrics, it was impossible to tell whether they are transient failures which will be retried or if they are fatal errors (i.e. exhausted all retries). I also added a warning log for errors which are retried so that we can at least see them.