-
Notifications
You must be signed in to change notification settings - Fork 3.4k
cli: Improve fetching of Cilium component logs in failure scenarios #37160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
570f0d5
to
da0c53a
Compare
/test |
marseel
reviewed
Jan 24, 2025
marseel
approved these changes
Jan 24, 2025
joamaki
approved these changes
Jan 27, 2025
da0c53a
to
aa0a5e0
Compare
/test |
brlbil
approved these changes
Jan 28, 2025
@derailed gentle ping for review 🙏 |
derailed
reviewed
Jan 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joestringer Nice work Joe. That would help a lot!
aa0a5e0
to
bcf4502
Compare
/test |
Previously the cilium status would pull logs from the k8s container status only if the container status didn't have logs. In my own local testing with k8s v1.29.x I found that the container status was regularly unreliable, in that it would fail to include the last few hundred bytes of logs from the agent in cases where the agent already logged a large amount of logs (eg 50KiB+). This would hide away critical information from the end of the log. Since we already have the limit of making only one cilium pod logs request per CLI status invocation, let's just always fetch the proper Pod logs first and only fall back to the container status output for investigation of second and subsequent failing Pods. Signed-off-by: Joe Stringer <joe@cilium.io>
This change moves the logic to pull and process cilium-agent logs into a dedicated function for reuse for cilium-operator and hubble-relay logs. Additionally, rather than overriding the logs gathered via container status, this splits the gathering into a dedicated step. This provides more consistency when multiple pods fail for the same reason. Signed-off-by: Joe Stringer <joe@cilium.io>
Previously, the Cilium CLI would print the first 50 messages printed up to 2 minutes before the last termination of the Cilium agent, in order to provide some context about why a failure may have occurred. This commit comes up with a slightly different strategy: Let's say that the 5 oldest lines, 5 newest lines and any errors/warnings within that 2 minute window are the most important. Make sure to print those. Omit all other logs to minimize noise to the user (but print a hint that some logs are missing, "<...>"). Signed-off-by: Joe Stringer <joe@cilium.io>
When the cilium-operator fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>
When the hubble-relay fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>
When the clustermesh-apiserver fails, such as CrashLoopBackoff, pull the previous logs for that component to share with the user. Signed-off-by: Joe Stringer <joe@cilium.io>
This function fetches container logs for a target pod, make the code clearer by renaming the function. Signed-off-by: Joe Stringer <joe@cilium.io>
bcf4502
to
9f2f9da
Compare
derailed
approved these changes
Jan 30, 2025
/test |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cilium-cli
This PR contains changes related with cilium-cli
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/minor
This PR changes functionality that users may find relevant to operating Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, the Cilium CLI would print the first 50 messages printed
up to 2 minutes before the last termination of the Cilium agent, in
order to provide some context about why a failure may have occurred.
This PR comes up with a slightly different strategy: Let's say that
the 5 oldest lines, 5 newest lines and any errors/warnings within that
2 minute window are the most important. Make sure to print those. Omit
all other logs to minimize noise to the user (but print a hint that some
logs are missing,
<...>
).Furthermore this extends the log gathering capabilities to not only
gather logs from the cilium-agent when it encounters errors, but also
from other components like cilium-operator, hubble-relay and
clustermesh-apiserver.
Some sample output when I simulated a bunch of errors in the cilium-agent,
operator and hubble-relay components: