Skip to content

Conversation

devodev
Copy link
Contributor

@devodev devodev commented May 14, 2025

Description

This PR extracts the initialization logic for the Hubble metrics server and the flow processor into its own Hive cell.
Since this work required abstracting the mechanism for retrieving certificates for the Hubble metrics server using certloader.WatchedServerConfig, I also re-used it for the Hubble server certificates.

Two notable behaviour changes:

  • The metrics server no longer blocks Hubble from starting while waiting for TLS certificates to become available.
  • The metrics server now reports warnings/errors in cilium status.
  • The metrics server and certloader tls cert watcher now reports their job status in the health table.

Additionally, the default value for the --hubble-disable-tls flag is now true. The value in helm options is unchanged. This is to prevent errors in hive tests returned by the certloader dependency requiring a filepath set when enabled at hive startup time. Before we did not face issues because the TLS config watcher was always setup in the launch method of Hubble, after the cell has been started. The helm chart always sets the value explicitly, therefore there is no behaviour change caused by this.

See individual commits for details.

Related: #34501
cc @kaworu

Extract the Hubble metrics server into its own Hive cell
Include more details about the Hubble metrics server state in cilium-status

Richer cilium status reporting

We now record and report warnings and errors when starting/running the metrics server.
Previously, we would always report OK even if the server crashed at some point. Additionally, now that Hubble no longer blocks at initialization waiting for the metrics server TLS certificates to become ready, we can report the state of Hubble and the metrics server independently.

Example:

$ cilium status
...
Hubble:                  Ok              Current/Max Flows: 4095/4095 (100.00%), Flows/s: 31.79   Metrics: Warning (Waiting for TLS certificates to become available)

We could think of expending this in the future to report more information, such as how many metrics handlers are enabled, and even which ones.

Additionally, by using one-shot jobs to register long-lived processes such as the metrics server and certloader tls watchers, we get entries in the health table:

$ cilium status --all-health
...
├── hubble
│   ├── hubble-metrics
│   │   ├── job-certloader-tls          [OK] Running (8m33s, x1)
│   │   └── job-hubble-metrics-server   [OK] Running (8m33s, x1)
│   ├── job-certloader-tls              [STOPPED] Running (8m33s, x1)
│   └── job-hubble                      [STOPPED] Running (8m33s, x1)

Certloader cell provider

While working on this, I noticed that the process of creating a FutureWatchedServerConfig and waiting for TLS certificates to become ready was duplicated for both the Hubble metrics server and the observer server. I took this opportunity to refactor that logic into a new certloader.NewWatchedServerConfigPromise constructor that can be used with cell.Provide and cell.ProvidePrivate.

At first, I considered adding a new "reusable" cell, similar to how the pprof cell has been refactored lately and allows providing custom flags for it. The difficulty here is in avoiding collisions/replacement with the provided type, here a promise over a WatchedServerConfig, when the cell is being used multiple times in the same hive. By only providing the constructor, its possible for a cell author to wrap the return type and provide that as a dependency. This is how we are doing it here with both the Hubble metrics server and the Hubble server cells.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 14, 2025
@devodev devodev force-pushed the pr/devodev/hubble-metrics-cell branch from 2f2ddce to 595e790 Compare May 14, 2025 22:26
@devodev devodev force-pushed the pr/devodev/hubble-metrics-cell branch 6 times, most recently from d992ddb to dc4de89 Compare May 22, 2025 19:46
@devodev devodev force-pushed the pr/devodev/hubble-metrics-cell branch from dc4de89 to 4c98aa3 Compare May 22, 2025 20:44
@devodev
Copy link
Contributor Author

devodev commented May 22, 2025

/test

@devodev devodev marked this pull request as ready for review May 23, 2025 17:45
@devodev devodev requested review from a team as code owners May 23, 2025 17:45
@devodev devodev requested review from thorn3r, glrf and aanm May 23, 2025 17:46
@aanm aanm enabled auto-merge May 26, 2025 17:48
@aanm aanm added the release-note/misc This PR makes changes that have no direct user impact. label May 26, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 26, 2025
@aanm aanm added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. area/hubble Impacts hubble server or relay labels May 26, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 26, 2025
Copy link
Contributor

@glrf glrf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall this looks like a huge improvement!

@devodev devodev marked this pull request as draft May 27, 2025 16:05
auto-merge was automatically disabled May 27, 2025 16:05

Pull request was converted to draft

@devodev devodev force-pushed the pr/devodev/hubble-metrics-cell branch from 4c98aa3 to e7e21ca Compare May 27, 2025 16:54
@devodev devodev force-pushed the pr/devodev/hubble-metrics-cell branch from ca9684a to 5486780 Compare June 2, 2025 19:29
@devodev devodev force-pushed the pr/devodev/hubble-metrics-cell branch 2 times, most recently from 1706c37 to 4e3e79e Compare June 3, 2025 14:18
devodev added 4 commits June 3, 2025 11:37
Add a new constructor that can be used in a cell provide call that
provides a Promise over a WatchedServerConfig. The promise resolves
when the certificate files become available. When disabled, the
promise will be nil, allowing to detect whether certloader was enabled
or not. On hive stop, the cell will take care of stopping the config
watcher if it exists.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
This new status object will be used by the incoming hubble-metrics
cell to report its status independently from the Hubble cell.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Abstract the OnDecodedFlow hook that we register when using
static metrics in a StaticFlowProcessor type to align with
the existing DynamicFlowProcessor.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Abstract the static and dynamic flow processors behind an interface.
This will allow providing the interface to the Hubble cell and have
hubble register the OnDecodedFlow hooks.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
@devodev devodev force-pushed the pr/devodev/hubble-metrics-cell branch from 4e3e79e to 05cc94b Compare June 3, 2025 16:01
devodev added 5 commits June 3, 2025 12:03
Create a new Hive cell that provides the hubble metrics server
and the metrics flow processor which is meant to be hooked in
the hubble observer to perform metrics recording.

The metrics server also implements a new Server interface that
will be used to report the server state from cilium-status.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Replace the current initialization logic for the metrics server
and flow processor with the new hubble-metrics cell.

Update the hubble status reporting to now use the new metrics server
status.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Upgrade the status reporting for the metrics server used in cilium-status
and include warnings and errors observed when starting/running the server.

More specifically, record and report a warning status when the server is waiting
for the TLS certificates to become ready, report an error if we failed to wait for
certificates and report an error if we failed serving http metrics.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
Replace the existing logic to setup and wait for certificates to become
available with a new certloader cell. It re-uses the new hive constructor:
certloader.NewWatchedServerConfigPromise and provides a promise
that the Hubble cell can await in its launch function.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
In its current form, FutureWatcher spawns a goroutine and sends
the newly created watcher on a channel when the watched files
become ready. Additionally, the created watcher also starts goroutines
for its file watching mechanism. If files never become ready and we
decide to abandon the watch, there is no way to abort the spawned
goroutines, therefore leaking them.

To mitigate, make FutureWatcher take a context and abort both the
outer goroutine and the created watcher+underlying goroutines on
cancel.

Update all certloader tests to detect and fail if any leaked goroutines
are detected using goleak.

Add a new test that ensures the hive is cleanly shutdown and does
not leak goroutines when watched cert files never become ready.

Signed-off-by: Alexandre Barone <abalexandrebarone@gmail.com>
@devodev devodev force-pushed the pr/devodev/hubble-metrics-cell branch from 05cc94b to 2c5f1e1 Compare June 3, 2025 16:04
Copy link
Contributor Author

@devodev devodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR based on feedback regarding usage of hive lifecycle hooks vs one-shot jobs for managing long-lived processes.

Bonus, we now get entries for each processes in the health table:

├── hubble
│   ├── hubble-metrics
│   │   ├── job-certloader-tls          [OK] Running (8m33s, x1)
│   │   └── job-hubble-metrics-server   [OK] Running (8m33s, x1)
│   ├── job-certloader-tls              [STOPPED] Running (8m33s, x1)
│   └── job-hubble                      [STOPPED] Running (8m33s, x1)

@devodev
Copy link
Contributor Author

devodev commented Jun 3, 2025

/test

@devodev
Copy link
Contributor Author

devodev commented Jun 3, 2025

/ci-ingress

@devodev
Copy link
Contributor Author

devodev commented Jun 3, 2025

/ci-ingress

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 3, 2025
@chancez chancez added this pull request to the merge queue Jun 3, 2025
Merged via the queue into cilium:main with commit 6b264c8 Jun 3, 2025
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hubble Impacts hubble server or relay ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants