Skip to content

cli: Improve fetching of Cilium component logs in failure scenarios #37160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 31, 2025

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Jan 22, 2025

Previously, the Cilium CLI would print the first 50 messages printed
up to 2 minutes before the last termination of the Cilium agent, in
order to provide some context about why a failure may have occurred.

This PR comes up with a slightly different strategy: Let's say that
the 5 oldest lines, 5 newest lines and any errors/warnings within that
2 minute window are the most important. Make sure to print those. Omit
all other logs to minimize noise to the user (but print a hint that some
logs are missing, <...>).

Furthermore this extends the log gathering capabilities to not only
gather logs from the cilium-agent when it encounters errors, but also
from other components like cilium-operator, hubble-relay and
clustermesh-apiserver.

Some sample output when I simulated a bunch of errors in the cilium-agent,
operator and hubble-relay components:

$ ./cilium-cli/cilium status --wait-duration=1s --wait --interactive=false          
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             4 errors
 \__/¯¯\__/    Operator:           2 errors
 /¯¯\__/¯¯\    Envoy DaemonSet:    OK
 \__/¯¯\__/    Hubble Relay:       2 errors
    \__/       ClusterMesh:        disabled

DaemonSet              cilium             Desired: 2, Unavailable: 2/2
DaemonSet              cilium-envoy       Desired: 2, Ready: 2/2, Available: 2/2
Deployment             cilium-operator    Desired: 1, Unavailable: 1/1
Deployment             hubble-relay       Desired: 1, Unavailable: 1/1
Containers:            cilium             Running: 2
                       cilium-envoy       Running: 2
                       cilium-operator    Running: 1
                       hubble-relay       Running: 1
Cluster Pods:          0/4 managed by Cilium
Helm chart version:    1.18.0-dev
Image versions         cilium             quay.io/cilium/cilium-ci:latest: 2
                       cilium-envoy       quay.io/cilium/cilium-envoy:v1.32.3-1737449866-585ba14700a2c729f024d9f0b2c694bc83a908a2@sha256:a3bbf5d8acc652f91e83c59890c61da5ee84c552be331fe9ad5389ad1a299562: 2
                       cilium-operator    quay.io/cilium/operator-generic-ci:latest: 1
                       hubble-relay       quay.io/cilium/hubble-relay-ci@sha256:063c07e2b465988fb25210f7ea94ed0dc1b2d1d529a4bd5a3f108a4e32634388: 1
Errors:                cilium             cilium          2 pods of DaemonSet cilium are not ready
                       cilium             cilium-dth9h    unable to retrieve cilium status: container cilium-agent is in CrashLoopBackOff, exited with code 1: ermesh.nodeManagerNotifier (pkg/clustermesh/notifier.go:38)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=6.86µs function="clustermesh.nodeManagerNotifier (pkg/clustermesh/notifier.go:38)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="stats.init.func3 (.../nat/stats/cell.go:61)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=551.756µs function="stats.init.func3 (.../nat/stats/cell.go:61)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="dynamicconfig.RegisterConfigMapReflector (pkg/dynamicconfig/table.go:90)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=487.35µs function="dynamicconfig.RegisterConfigMapReflector (pkg/dynamicconfig/table.go:90)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="dynamiclifecycle.registerWatcher (pkg/dynamiclifecycle/watcher.go:125)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=492.1µs function="dynamiclifecycle.registerWatcher (pkg/dynamiclifecycle/watcher.go:125)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="dynamiclifecycle.registerReconciler (pkg/dynamiclifecycle/reconciler.go:120)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=29.343µs function="dynamiclifecycle.registerReconciler (pkg/dynamiclifecycle/reconciler.go:120)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=459.917µs function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
time=2025-01-24T00:43:02Z level=debug msg=Invoking function="features.updateAgentConfigMetricOnStart (.../metrics/features/features.go:14)"
time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=5
    cilium    cilium-dth9h    container cilium-agent is in CrashLoopBackOff, pulling previous Pod logs for further investigation:
2025-01-24T00:43:02.601425337Z time="2025-01-24T00:43:02.601305767Z" level=debug msg="Skipped reading configuration file" error="Config File \"cilium\" Not Found in \"[/root]\"" subsys=config
2025-01-24T00:43:02.601606875Z time="2025-01-24T00:43:02.601571414Z" level=info msg="Memory available for map entries (0.250% of 33283883008B): 83209707B" subsys=config
2025-01-24T00:43:02.601613454Z time="2025-01-24T00:43:02.601579129Z" level=debug msg="Total memory for default map entries: 149422080" subsys=config
2025-01-24T00:43:02.601615701Z time="2025-01-24T00:43:02.601583651Z" level=info msg="option bpf-ct-global-tcp-max set by dynamic sizing to 291963" subsys=config
2025-01-24T00:43:02.601618004Z time="2025-01-24T00:43:02.601586582Z" level=info msg="option bpf-ct-global-any-max set by dynamic sizing to 145981" subsys=config
<...>
2025-01-24T00:43:02.827153296Z time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=459.917µs function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
2025-01-24T00:43:02.827163834Z time=2025-01-24T00:43:02Z level=debug msg=Invoking function="features.updateAgentConfigMetricOnStart (.../metrics/features/features.go:14)"
2025-01-24T00:43:02.827624091Z time=2025-01-24T00:43:02Z level=debug msg=Invoked duration=502.504µs function="features.updateAgentConfigMetricOnStart (.../metrics/features/features.go:14)"
2025-01-24T00:43:02.827638509Z time=2025-01-24T00:43:02Z level=debug msg=Invoking function="cmd.init.func4 (cmd/cells.go:318)"
2025-01-24T00:43:02.828002710Z time="2025-01-24T00:43:02.827962574Z" level=fatal msg="joe is BREAKING cilium 😱" subsys=daemon
    cilium    cilium-k46l9    unable to retrieve cilium status: container cilium-agent is in CrashLoopBackOff, exited with code 1: esh.nodeManagerNotifier (pkg/clustermesh/notifier.go:38)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=9.986µs function="clustermesh.nodeManagerNotifier (pkg/clustermesh/notifier.go:38)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="stats.init.func3 (.../nat/stats/cell.go:61)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=560.968µs function="stats.init.func3 (.../nat/stats/cell.go:61)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="dynamicconfig.RegisterConfigMapReflector (pkg/dynamicconfig/table.go:90)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=508.107µs function="dynamicconfig.RegisterConfigMapReflector (pkg/dynamicconfig/table.go:90)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="dynamiclifecycle.registerWatcher (pkg/dynamiclifecycle/watcher.go:125)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=506.31µs function="dynamiclifecycle.registerWatcher (pkg/dynamiclifecycle/watcher.go:125)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="dynamiclifecycle.registerReconciler (pkg/dynamiclifecycle/reconciler.go:120)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=35.509µs function="dynamiclifecycle.registerReconciler (pkg/dynamiclifecycle/reconciler.go:120)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=477.289µs function="driftchecker.Register (pkg/driftchecker/checker.go:45)"
time=2025-01-24T00:43:23Z level=debug msg=Invoking function="features.updateAgentConfigMetricOnStart (.../metrics/features/features.go:14)"
time=2025-01-24T00:43:23Z level=debug msg=Invoked duration=5
    cilium-operator    cilium-operator                    1 pods of Deployment cilium-operator are not ready
    cilium-operator    cilium-operator-658fbf65c-bs78c    container cilium-operator is in CrashLoopBackOff, pulling previous Pod logs for further investigation:
2025-01-24T00:44:22.646874302Z Error: unknown flag: --bad-arg-yo
2025-01-24T00:44:22.647348762Z Usage:
2025-01-24T00:44:22.647367160Z   cilium-operator-generic [flags]
2025-01-24T00:44:22.647372538Z   cilium-operator-generic [command]
2025-01-24T00:44:22.647375700Z 
<...>
2025-01-24T00:44:22.647792729Z       --validate-network-policy                              Whether to enable or disable the informational network policy validator (default true)
2025-01-24T00:44:22.647795371Z       --version                                              Print version information
2025-01-24T00:44:22.647797851Z 
2025-01-24T00:44:22.647800608Z Use "cilium-operator-generic [command] --help" for more information about a command.
2025-01-24T00:44:22.647803181Z
    hubble-relay    hubble-relay                     1 pods of Deployment hubble-relay are not ready
    hubble-relay    hubble-relay-6cbf9894c4-lj5kc    container hubble-relay is in CrashLoopBackOff, pulling previous Pod logs for further investigation:
2025-01-24T00:45:25.063779760Z Error: failed to start gops agent: mkdir /.config: permission denied

Error: Unable to determine status:  wait canceled, cilium agent container has crashed or was terminated: context canceled

@joestringer joestringer added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jan 22, 2025
@github-actions github-actions bot added the cilium-cli This PR contains changes related with cilium-cli label Jan 22, 2025
@joestringer joestringer changed the title cli: Improve UX when encountering errors in Cilium components cli: Improve fetching of Cilium component logs in failure scenarios Jan 22, 2025
@joestringer joestringer force-pushed the pr/joe/cli-print-whats-wrong-maybe branch 4 times, most recently from 570f0d5 to da0c53a Compare January 24, 2025 01:15
@joestringer
Copy link
Member Author

/test

@joestringer joestringer marked this pull request as ready for review January 24, 2025 02:15
@joestringer joestringer requested review from a team as code owners January 24, 2025 02:15
@joestringer joestringer force-pushed the pr/joe/cli-print-whats-wrong-maybe branch from da0c53a to aa0a5e0 Compare January 28, 2025 01:50
@joestringer joestringer requested a review from a team as a code owner January 28, 2025 01:50
@joestringer joestringer requested a review from brlbil January 28, 2025 01:50
@joestringer
Copy link
Member Author

/test

@joestringer joestringer enabled auto-merge January 28, 2025 01:51
@joestringer
Copy link
Member Author

@derailed gentle ping for review 🙏

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joestringer Nice work Joe. That would help a lot!

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jan 30, 2025
@joestringer joestringer force-pushed the pr/joe/cli-print-whats-wrong-maybe branch from aa0a5e0 to bcf4502 Compare January 30, 2025 16:56
@joestringer
Copy link
Member Author

/test

Previously the cilium status would pull logs from the k8s container
status only if the container status didn't have logs. In my own local
testing with k8s v1.29.x I found that the container status was regularly
unreliable, in that it would fail to include the last few hundred bytes
of logs from the agent in cases where the agent already logged a large
amount of logs (eg 50KiB+). This would hide away critical information
from the end of the log.

Since we already have the limit of making only one cilium pod logs
request per CLI status invocation, let's just always fetch the proper
Pod logs first and only fall back to the container status output for
investigation of second and subsequent failing Pods.

Signed-off-by: Joe Stringer <joe@cilium.io>
This change moves the logic to pull and process cilium-agent logs into a
dedicated function for reuse for cilium-operator and hubble-relay logs.
Additionally, rather than overriding the logs gathered via container
status, this splits the gathering into a dedicated step. This provides
more consistency when multiple pods fail for the same reason.

Signed-off-by: Joe Stringer <joe@cilium.io>
Previously, the Cilium CLI would print the first 50 messages printed
up to 2 minutes before the last termination of the Cilium agent, in
order to provide some context about why a failure may have occurred.

This commit comes up with a slightly different strategy: Let's say that
the 5 oldest lines, 5 newest lines and any errors/warnings within that
2 minute window are the most important. Make sure to print those. Omit
all other logs to minimize noise to the user (but print a hint that some
logs are missing, "<...>").

Signed-off-by: Joe Stringer <joe@cilium.io>
When the cilium-operator fails, such as CrashLoopBackoff, pull the
previous logs for that component to share with the user.

Signed-off-by: Joe Stringer <joe@cilium.io>
When the hubble-relay fails, such as CrashLoopBackoff, pull the
previous logs for that component to share with the user.

Signed-off-by: Joe Stringer <joe@cilium.io>
When the clustermesh-apiserver fails, such as CrashLoopBackoff, pull
the previous logs for that component to share with the user.

Signed-off-by: Joe Stringer <joe@cilium.io>
This function fetches container logs for a target pod, make the code
clearer by renaming the function.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer force-pushed the pr/joe/cli-print-whats-wrong-maybe branch from bcf4502 to 9f2f9da Compare January 30, 2025 17:23
@joestringer
Copy link
Member Author

/test

@joestringer joestringer added this pull request to the merge queue Jan 31, 2025
Merged via the queue into main with commit 448fe4f Jan 31, 2025
218 checks passed
@joestringer joestringer deleted the pr/joe/cli-print-whats-wrong-maybe branch January 31, 2025 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cilium-cli This PR contains changes related with cilium-cli ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants