-
Notifications
You must be signed in to change notification settings - Fork 3.4k
hubble: Metrics server cell #39549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble: Metrics server cell #39549
Conversation
2f2ddce
to
595e790
Compare
d992ddb
to
dc4de89
Compare
dc4de89
to
4c98aa3
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Overall this looks like a huge improvement!
Pull request was converted to draft
4c98aa3
to
e7e21ca
Compare
ca9684a
to
5486780
Compare
1706c37
to
4e3e79e
Compare
Add a new constructor that can be used in a cell provide call that provides a Promise over a WatchedServerConfig. The promise resolves when the certificate files become available. When disabled, the promise will be nil, allowing to detect whether certloader was enabled or not. On hive stop, the cell will take care of stopping the config watcher if it exists. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
This new status object will be used by the incoming hubble-metrics cell to report its status independently from the Hubble cell. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Abstract the OnDecodedFlow hook that we register when using static metrics in a StaticFlowProcessor type to align with the existing DynamicFlowProcessor. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Abstract the static and dynamic flow processors behind an interface. This will allow providing the interface to the Hubble cell and have hubble register the OnDecodedFlow hooks. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
4e3e79e
to
05cc94b
Compare
Create a new Hive cell that provides the hubble metrics server and the metrics flow processor which is meant to be hooked in the hubble observer to perform metrics recording. The metrics server also implements a new Server interface that will be used to report the server state from cilium-status. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Replace the current initialization logic for the metrics server and flow processor with the new hubble-metrics cell. Update the hubble status reporting to now use the new metrics server status. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Upgrade the status reporting for the metrics server used in cilium-status and include warnings and errors observed when starting/running the server. More specifically, record and report a warning status when the server is waiting for the TLS certificates to become ready, report an error if we failed to wait for certificates and report an error if we failed serving http metrics. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Replace the existing logic to setup and wait for certificates to become available with a new certloader cell. It re-uses the new hive constructor: certloader.NewWatchedServerConfigPromise and provides a promise that the Hubble cell can await in its launch function. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
In its current form, FutureWatcher spawns a goroutine and sends the newly created watcher on a channel when the watched files become ready. Additionally, the created watcher also starts goroutines for its file watching mechanism. If files never become ready and we decide to abandon the watch, there is no way to abort the spawned goroutines, therefore leaking them. To mitigate, make FutureWatcher take a context and abort both the outer goroutine and the created watcher+underlying goroutines on cancel. Update all certloader tests to detect and fail if any leaked goroutines are detected using goleak. Add a new test that ensures the hive is cleanly shutdown and does not leak goroutines when watched cert files never become ready. Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
05cc94b
to
2c5f1e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR based on feedback regarding usage of hive lifecycle hooks vs one-shot jobs for managing long-lived processes.
Bonus, we now get entries for each processes in the health table:
├── hubble
│ ├── hubble-metrics
│ │ ├── job-certloader-tls [OK] Running (8m33s, x1)
│ │ └── job-hubble-metrics-server [OK] Running (8m33s, x1)
│ ├── job-certloader-tls [STOPPED] Running (8m33s, x1)
│ └── job-hubble [STOPPED] Running (8m33s, x1)
/test |
/ci-ingress |
/ci-ingress |
Description
This PR extracts the initialization logic for the Hubble metrics server and the flow processor into its own Hive cell.
Since this work required abstracting the mechanism for retrieving certificates for the Hubble metrics server using
certloader.WatchedServerConfig
, I also re-used it for the Hubble server certificates.Two notable behaviour changes:
Additionally, the default value for the
--hubble-disable-tls
flag is nowtrue
. The value in helm options is unchanged. This is to prevent errors in hive tests returned by the certloader dependency requiring a filepath set when enabled at hive startup time. Before we did not face issues because the TLS config watcher was always setup in the launch method of Hubble, after the cell has been started. The helm chart always sets the value explicitly, therefore there is no behaviour change caused by this.See individual commits for details.
Related: #34501
cc @kaworu
Richer cilium status reporting
We now record and report warnings and errors when starting/running the metrics server.
Previously, we would always report OK even if the server crashed at some point. Additionally, now that Hubble no longer blocks at initialization waiting for the metrics server TLS certificates to become ready, we can report the state of Hubble and the metrics server independently.
Example:
We could think of expending this in the future to report more information, such as how many metrics handlers are enabled, and even which ones.
Additionally, by using one-shot jobs to register long-lived processes such as the metrics server and certloader tls watchers, we get entries in the health table:
Certloader cell provider
While working on this, I noticed that the process of creating a FutureWatchedServerConfig and waiting for TLS certificates to become ready was duplicated for both the Hubble metrics server and the observer server. I took this opportunity to refactor that logic into a new
certloader.NewWatchedServerConfigPromise
constructor that can be used withcell.Provide
andcell.ProvidePrivate
.At first, I considered adding a new "reusable" cell, similar to how the pprof cell has been refactored lately and allows providing custom flags for it. The difficulty here is in avoiding collisions/replacement with the provided type, here a promise over a WatchedServerConfig, when the cell is being used multiple times in the same hive. By only providing the constructor, its possible for a cell author to wrap the return type and provide that as a dependency. This is how we are doing it here with both the Hubble metrics server and the Hubble server cells.