cli: Improve fetching of Cilium component logs in failure scenarios #37160

joestringer · 2025-01-22T03:03:22Z

Previously, the Cilium CLI would print the first 50 messages printed
up to 2 minutes before the last termination of the Cilium agent, in
order to provide some context about why a failure may have occurred.

This PR comes up with a slightly different strategy: Let's say that
the 5 oldest lines, 5 newest lines and any errors/warnings within that
2 minute window are the most important. Make sure to print those. Omit
all other logs to minimize noise to the user (but print a hint that some
logs are missing, <...>).

Furthermore this extends the log gathering capabilities to not only
gather logs from the cilium-agent when it encounters errors, but also
from other components like cilium-operator, hubble-relay and
clustermesh-apiserver.

Some sample output when I simulated a bunch of errors in the cilium-agent,
operator and hubble-relay components:

$ ./cilium-cli/cilium status --wait-duration=1s --wait --interactive=false          
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             4 errors
 \__/¯¯\__/    Operator:           2 errors
 /¯¯\__/¯¯\    Envoy DaemonSet:    OK
 \__/¯¯\__/    Hubble Relay:       2 errors
    \__/       ClusterMesh:        disabled

DaemonSet              cilium             Desired: 2, Unavailable: 2/2
DaemonSet              cilium-envoy       Desired: 2, Ready: 2/2, Available: 2/2
Deployment             cilium-operator    Desired: 1, Unavailable: 1/1
Deployment             hubble-relay       Desired: 1, Unavailable: 1/1
Containers:            cilium             Running: 2
                       cilium-envoy       Running: 2
                       cilium-operator    Running: 1
                       hubble-relay       Running: 1
Cluster Pods:          0/4 managed by Cilium
Helm chart version:    1.18.0-dev
Image versions         cilium             quay.io/cilium/cilium-ci:latest: 2
                       cilium-envoy       quay.io/cilium/cilium-envoy:v1.32.3-1737449866-585ba14700a2c729f024d9f0b2c694bc83a908a2@sha256:a3bbf5d8acc652f91e83c59890c61da5ee84c552be331fe9ad5389ad1a299562: 2
                       cilium-operator    quay.io/cilium/operator-generic-ci:latest: 1
                       hubble-relay       quay.io/cilium/hubble-relay-ci@sha256:063c07e2b465988fb25210f7ea94ed0dc1b2d1d529a4bd5a3f108a4e32634388: 1
Errors:                cilium             cilium          2 pods of DaemonSet cilium are not ready
                       cilium             cilium-dth9h    unable to retrieve cilium status: container cilium-agent is in CrashLoopBackOff, exited with code 1: ermesh.nodeManagerNotifier (pkg/clustermesh/notifier.go:38)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=6.86µs function="clustermesh.nodeManagerNotifier (pkg/clustermesh/notifier.go:38)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="stats.init.func3 (.../nat/stats/cell.go:61)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=551.756µs function="stats.init.func3 (.../nat/stats/cell.go:61)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="dynamicconfig.RegisterConfigMapReflector (pkg/dynamicconfig/table.go:90)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=487.35µs function="dynamicconfig.RegisterConfigMapReflector (pkg/dynamicconfig/table.go:90)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="dynamiclifecycle.registerWatcher (pkg/dynamiclifecycle/watcher.go:125)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=492.1µs function="dynamiclifecycle.registerWatcher (pkg/dynamiclifecycle/watcher.go:125)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="dynamiclifecycle.registerReconciler (pkg/dynamiclifecycle/reconciler.go:120)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=29.343µs function="dynamiclifecycle.registerReconciler (pkg/dynamiclifecycle/reconciler.go:120)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=459.917µs function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="features.updateAgentConfigMetricOnStart (.../metrics/features/features.go:14)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=5
    cilium    cilium-dth9h    container cilium-agent is in CrashLoopBackOff, pulling previous Pod logs for further investigation:
2025-01-24T00:43:02.601425337Z time="2025-01-24T00:43:02.601305767Z" level=debug msg="Skipped reading configuration file" error="Config File \"cilium\" Not Found in \"[/root]\"" subsys=config
2025-01-24T00:43:02.601606875Z time="2025-01-24T00:43:02.601571414Z" level=info msg="Memory available for map entries (0.250% of 33283883008B): 83209707B" subsys=config
2025-01-24T00:43:02.601613454Z time="2025-01-24T00:43:02.601579129Z" level=debug msg="Total memory for default map entries: 149422080" subsys=config
2025-01-24T00:43:02.601615701Z time="2025-01-24T00:43:02.601583651Z" level=info msg="option bpf-ct-global-tcp-max set by dynamic sizing to 291963" subsys=config
2025-01-24T00:43:02.601618004Z time="2025-01-24T00:43:02.601586582Z" level=info msg="option bpf-ct-global-any-max set by dynamic sizing to 145981" subsys=config
<...>
2025-01-24T00:43:02.827153296Z time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=459.917µs function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
2025-01-24T00:43:02.827163834Z time=2025-01-24T00:43:02Z level=debug msg=Invoking function="features.updateAgentConfigMetricOnStart (.../metrics/features/features.go:14)"
2025-01-24T00:43:02.827624091Z time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=502.504µs function="features.updateAgentConfigMetricOnStart (.../metrics/features/features.go:14)"
2025-01-24T00:43:02.827638509Z time=2025-01-24T00:43:02Z level=debug msg=Invoking function="cmd.init.func4 (cmd/cells.go:318)"
2025-01-24T00:43:02.828002710Z time="2025-01-24T00:43:02.827962574Z" level=fatal msg="joe is BREAKING cilium 😱" subsys=daemon
    cilium    cilium-k46l9    unable to retrieve cilium status: container cilium-agent is in CrashLoopBackOff, exited with code 1: esh.nodeManagerNotifier (pkg/clustermesh/notifier.go:38)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=9.986µs function="clustermesh.nodeManagerNotifier (pkg/clustermesh/notifier.go:38)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="stats.init.func3 (.../nat/stats/cell.go:61)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=560.968µs function="stats.init.func3 (.../nat/stats/cell.go:61)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="dynamicconfig.RegisterConfigMapReflector (pkg/dynamicconfig/table.go:90)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=508.107µs function="dynamicconfig.RegisterConfigMapReflector (pkg/dynamicconfig/table.go:90)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="dynamiclifecycle.registerWatcher (pkg/dynamiclifecycle/watcher.go:125)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=506.31µs function="dynamiclifecycle.registerWatcher (pkg/dynamiclifecycle/watcher.go:125)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="dynamiclifecycle.registerReconciler (pkg/dynamiclifecycle/reconciler.go:120)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=35.509µs function="dynamiclifecycle.registerReconciler (pkg/dynamiclifecycle/reconciler.go:120)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=477.289µs function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="features.updateAgentConfigMetricOnStart (.../metrics/features/features.go:14)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=5
    cilium-operator    cilium-operator                    1 pods of Deployment cilium-operator are not ready
    cilium-operator    cilium-operator-658fbf65c-bs78c    container cilium-operator is in CrashLoopBackOff, pulling previous Pod logs for further investigation:
2025-01-24T00:44:22.646874302Z Error: unknown flag: --bad-arg-yo
2025-01-24T00:44:22.647348762Z Usage:
2025-01-24T00:44:22.647367160Z   cilium-operator-generic [flags]
2025-01-24T00:44:22.647372538Z   cilium-operator-generic [command]
2025-01-24T00:44:22.647375700Z 
<...>
2025-01-24T00:44:22.647792729Z       --validate-network-policy                              Whether to enable or disable the informational network policy validator (default true)
2025-01-24T00:44:22.647795371Z       --version                                              Print version information
2025-01-24T00:44:22.647797851Z 
2025-01-24T00:44:22.647800608Z Use "cilium-operator-generic [command] --help" for more information about a command.
2025-01-24T00:44:22.647803181Z
    hubble-relay    hubble-relay                     1 pods of Deployment hubble-relay are not ready
    hubble-relay    hubble-relay-6cbf9894c4-lj5kc    container hubble-relay is in CrashLoopBackOff, pulling previous Pod logs for further investigation:
2025-01-24T00:45:25.063779760Z Error: failed to start gops agent: mkdir /.config: permission denied

Error: Unable to determine status:  wait canceled, cilium agent container has crashed or was terminated: context canceled

joestringer · 2025-01-24T01:15:58Z

/test

cilium-cli/clustermesh/clustermesh.go

joestringer · 2025-01-28T01:50:59Z

/test

joestringer · 2025-01-28T23:59:06Z

@derailed gentle ping for review 🙏

derailed

@joestringer Nice work Joe. That would help a lot!

cilium-cli/status/k8s.go

joestringer · 2025-01-30T16:56:49Z

/test

Previously the cilium status would pull logs from the k8s container status only if the container status didn't have logs. In my own local testing with k8s v1.29.x I found that the container status was regularly unreliable, in that it would fail to include the last few hundred bytes of logs from the agent in cases where the agent already logged a large amount of logs (eg 50KiB+). This would hide away critical information from the end of the log. Since we already have the limit of making only one cilium pod logs request per CLI status invocation, let's just always fetch the proper Pod logs first and only fall back to the container status output for investigation of second and subsequent failing Pods. Signed-off-by: Joe Stringer <joe@cilium.io>

This change moves the logic to pull and process cilium-agent logs into a dedicated function for reuse for cilium-operator and hubble-relay logs. Additionally, rather than overriding the logs gathered via container status, this splits the gathering into a dedicated step. This provides more consistency when multiple pods fail for the same reason. Signed-off-by: Joe Stringer <joe@cilium.io>

Previously, the Cilium CLI would print the first 50 messages printed up to 2 minutes before the last termination of the Cilium agent, in order to provide some context about why a failure may have occurred. This commit comes up with a slightly different strategy: Let's say that the 5 oldest lines, 5 newest lines and any errors/warnings within that 2 minute window are the most important. Make sure to print those. Omit all other logs to minimize noise to the user (but print a hint that some logs are missing, "<...>"). Signed-off-by: Joe Stringer <joe@cilium.io>

When the cilium-operator fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>

When the hubble-relay fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>

When the clustermesh-apiserver fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>

This function fetches container logs for a target pod, make the code clearer by renaming the function. Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer · 2025-01-30T18:01:19Z

/test

joestringer added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jan 22, 2025

github-actions bot added the cilium-cli This PR contains changes related with cilium-cli label Jan 22, 2025

joestringer changed the title ~~cli: Improve UX when encountering errors in Cilium components~~ cli: Improve fetching of Cilium component logs in failure scenarios Jan 22, 2025

joestringer force-pushed the pr/joe/cli-print-whats-wrong-maybe branch 4 times, most recently from 570f0d5 to da0c53a Compare January 24, 2025 01:15

joestringer marked this pull request as ready for review January 24, 2025 02:15

joestringer requested review from a team as code owners January 24, 2025 02:15

joestringer requested review from joamaki, derailed and marseel January 24, 2025 02:15

marseel reviewed Jan 24, 2025

View reviewed changes

cilium-cli/clustermesh/clustermesh.go Outdated Show resolved Hide resolved

marseel approved these changes Jan 24, 2025

View reviewed changes

joamaki approved these changes Jan 27, 2025

View reviewed changes

joestringer force-pushed the pr/joe/cli-print-whats-wrong-maybe branch from da0c53a to aa0a5e0 Compare January 28, 2025 01:50

joestringer requested a review from a team as a code owner January 28, 2025 01:50

joestringer requested a review from brlbil January 28, 2025 01:50

joestringer enabled auto-merge January 28, 2025 01:51

brlbil approved these changes Jan 28, 2025

View reviewed changes

derailed reviewed Jan 30, 2025

View reviewed changes

cilium-cli/status/k8s.go Outdated Show resolved Hide resolved

maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jan 30, 2025

joestringer force-pushed the pr/joe/cli-print-whats-wrong-maybe branch from aa0a5e0 to bcf4502 Compare January 30, 2025 16:56

joestringer added 7 commits January 30, 2025 09:23

cli: Pull cilium-operator logs on error

a3e741a

When the cilium-operator fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>

cli: Pull hubble-relay logs on error

0321538

When the hubble-relay fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>

cli: Pull clustermesh logs on error

bfd478c

When the clustermesh-apiserver fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>

cli: Rename log function for clarity

9f2f9da

This function fetches container logs for a target pod, make the code clearer by renaming the function. Signed-off-by: Joe Stringer <joe@cilium.io>

joestringer force-pushed the pr/joe/cli-print-whats-wrong-maybe branch from bcf4502 to 9f2f9da Compare January 30, 2025 17:23

derailed approved these changes Jan 30, 2025

View reviewed changes

joestringer added this pull request to the merge queue Jan 31, 2025

Merged via the queue into main with commit 448fe4f Jan 31, 2025
218 checks passed

joestringer deleted the pr/joe/cli-print-whats-wrong-maybe branch January 31, 2025 10:41

joestringer mentioned this pull request Feb 20, 2025

chore(deps): update dependency cilium/cilium-cli to v0.17.0 #37782

Closed

github-actions bot mentioned this pull request Feb 24, 2025

Bump ciliumcli from v0.16.0 to v0.18.3 kokyhm/kubespray_dependency_upgrade_test#153

Open

roman-kiselenko mentioned this pull request Mar 17, 2025

cli: Reduce the flood of the terminal with logs on failure during tests #38240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cli: Improve fetching of Cilium component logs in failure scenarios #37160

cli: Improve fetching of Cilium component logs in failure scenarios #37160

Uh oh!

joestringer commented Jan 22, 2025 •

edited

Loading

Uh oh!

joestringer commented Jan 24, 2025

Uh oh!

Uh oh!

joestringer commented Jan 28, 2025

Uh oh!

joestringer commented Jan 28, 2025

Uh oh!

derailed left a comment

Uh oh!

Uh oh!

joestringer commented Jan 30, 2025

Uh oh!

joestringer commented Jan 30, 2025

Uh oh!

Uh oh!

Uh oh!

cli: Improve fetching of Cilium component logs in failure scenarios #37160

cli: Improve fetching of Cilium component logs in failure scenarios #37160

Uh oh!

Conversation

joestringer commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joestringer commented Jan 24, 2025

Uh oh!

Uh oh!

joestringer commented Jan 28, 2025

Uh oh!

joestringer commented Jan 28, 2025

Uh oh!

derailed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joestringer commented Jan 30, 2025

Uh oh!

joestringer commented Jan 30, 2025

Uh oh!

Uh oh!

Uh oh!

joestringer commented Jan 22, 2025 •

edited

Loading