-
Notifications
You must be signed in to change notification settings - Fork 3.4k
test: Fix kubectl log retrieval for badLogMessages #10717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
2ea5384
to
0aff768
Compare
2c89376
to
854d006
Compare
854d006
to
126a51e
Compare
74aa57d
to
a15ec1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly ok with this going in but I'm worried about how it will evolve over time. I can be convinced either way, so this isn't a hard "request changes"
test/helpers/kubectl.go
Outdated
@@ -2337,12 +2337,18 @@ func (kub *Kubectl) CiliumCheckReport(ctx context.Context) { | |||
// `deadlocks` or `segmentation faults` messages. In case of any of these | |||
// messages, it'll mark the test as failed. | |||
func (kub *Kubectl) ValidateNoErrorsInLogs(duration time.Duration) { | |||
kub.ValidateNoErrorsInLogsWhitelist(duration, map[string]bool{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to pass in the whitelist as a nil. It's not a big deal. We have every combination of style in the code already.
test/helpers/utils.go
Outdated
@@ -430,9 +430,12 @@ func CanRunK8sVersion(ciliumVersion, k8sVersionStr string) (bool, error) { | |||
// failIfContainsBadLogMsg makes a test case to fail if any message from | |||
// given log messages contains an entry from badLogMessages (map key) AND | |||
// does not contain ignore messages (map value). | |||
func failIfContainsBadLogMsg(logs string) { | |||
func failIfContainsBadLogMsg(logs string, whitelist map[string]bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this replicate what the values in the badLogMessages does already? I would prefer to pass in the badLogMessages
instead of added a slightly different whitelist mechanism. On the other hand, I'm not sure if that system is as flexible or simple as you need it since all you wanted here was to suppress blacklisting one message :/ (this is where I'm on the fence. Your new whitelist is a lot more clear)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure which way to go either. If we pass badLogMessages
, we would likely still want two ValidateNoErrorsInLogs{,Default}
functions to avoid changing all call sites, but we wouldn't need to check the whitelist at each loop iteration.
However, if we pass badLogMessages
, we'll need to change the default map for K8sUpdates
. If we create a new map specifically for K8sUpdates
, we risk missing future additions to badLogMessages
. From my (still limited) understanding of Go, we would need to make a deep copy of badLogMessages
to remove one entry (correct?), which sounds more involved than using a whitelist mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While whitelist may seem clear, I think we should not change how this func works, and explicitly pass badLogMessages as an argument, as Ray suggested.
I think we should add a function which will copy global badLogMessages
and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.
As of this change, this function effectively gets two maps and calculates a difference between them, I think it should be on the caller to define "bad" log messages, and adding a whitelist adds unnecessary complexity into the function and makes it less readable.
We shouldn't have used a global variable in the first place, but adding whitelist seems like a hacky workaround instead of making the code better structured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, that is exactly what went through my mind too. You're also correct about the map deep-copy (unless they made copy
do something in recent versions).
I guess my thinking came down to two questions (I'm not as worried as you about changing all the callsites despite it making the PR look much bigger):
1- what will people copy-paste?
2- will we ever debug a test because we think the whitelist is only the parameter, but then the built-in one is also taking effect.
In any case, your point about missing updates to the original badLogMessages wins out, I think. I'll approve :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a function which will copy global
badLogMessages
and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.
Makes sense. I'll push an updated branch in a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the branch with:
- A
GetBadLogMessages()
function to get a copy of thebadLogMessages
. - A
ValidateListOfErrorsInLogs
function to pass a custom list of messages.ValidateNoErrorsInLogs
callsValidateListOfErrorsInLogs
with the defaultbadLogMessages
.
a15ec1a
to
432ceab
Compare
test/helpers/utils.go
Outdated
@@ -430,9 +430,12 @@ func CanRunK8sVersion(ciliumVersion, k8sVersionStr string) (bool, error) { | |||
// failIfContainsBadLogMsg makes a test case to fail if any message from | |||
// given log messages contains an entry from badLogMessages (map key) AND | |||
// does not contain ignore messages (map value). | |||
func failIfContainsBadLogMsg(logs string) { | |||
func failIfContainsBadLogMsg(logs string, whitelist map[string]bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While whitelist may seem clear, I think we should not change how this func works, and explicitly pass badLogMessages as an argument, as Ray suggested.
I think we should add a function which will copy global badLogMessages
and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.
As of this change, this function effectively gets two maps and calculates a difference between them, I think it should be on the caller to define "bad" log messages, and adding a whitelist adds unnecessary complexity into the function and makes it less readable.
We shouldn't have used a global variable in the first place, but adding whitelist seems like a hacky workaround instead of making the code better structured.
omfg github, why can't you update a page!?!?!! |
432ceab
to
a2e4c68
Compare
exceptionsCopy := make([]string, len(exceptions)) | ||
copy(exceptionsCopy, exceptions) | ||
mapCopy[badMsg] = exceptionsCopy | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth for someone more familiar with Go to double check this deep copy ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
exceptionsCopy := make([]string, len(exceptions)) | ||
copy(exceptionsCopy, exceptions) | ||
mapCopy[badMsg] = exceptionsCopy | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
a2e4c68
to
49b2d09
Compare
CI was hitting #10760. test-me-please |
CI was hitting #10774. test-focus K8sUpdates.* |
test-me-please (known flake, all tests passed otherwise https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/18353/) |
test-me-please |
After each test case, we retrieve Cilium logs to look for known-bad log messages. Daniel recently noticed that tests containing known-bad message logs are passing. The issue is in the kubectl logs command we use to retrieve the logs. Its behavior seems to have changed at some point. When used with a label selector---as we do---the option --tail has a default value of 10 instead of -1. We are therefore only seeing the 10 last lines of each Cilium pod, regardless of the --since option's value. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Paul Chaignon <paul@cilium.io>
This commit defines a new function ValidateNoErrorsInLogsWhitelist() to enable whitelisting of some (ordinarily) known-bad messages when validating the logs. We then use this new function to whitelist 'Removing map to allow for property upgrade' messages during up/downgrade tests, during which such messages are expected. Signed-off-by: Paul Chaignon <paul@cilium.io>
49b2d09
to
93570d1
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice finding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@pchaigno What's the general rationale for backporting this PR and have you looked at the potential required PRs to make it test successfully? Also worth considering whether it makes sense for only backporting to the latest (v1.7) release and not the earlier releases. I noticed the following potentially related PRs from backport perspective, #10680 ; #10699 ; #10728. The documented rationale to backport PRs to the latest release is "all bugfixes", then older releases have more strict guidelines: I'm happy to be convinced that we should take this change (we have taken CI improvements back to the latest release in the past), but right now this seems like a potentially large amount of backporting work so I don't want to go ahead and do it without considering whether it's worthwhile. |
Looking again it seems like the first commit is a no-brainer while the second commit looks dubious for backport. We typically backport on a per-PR basis rather than a per-commit basis but perhaps we should make an exception for this PR? Anyway, I am ignoring this PR for now from backporting perspective. |
Ah, I was missing these guidelines. So, yes, only first commit should be backported to only 1.7 (second commit is not needed since |
After each test case, we retrieve Cilium logs to look for known-bad log messages. Daniel recently noticed that tests containing known-bad message logs are passing.
The issue is in the kubectl logs command we use to retrieve the logs. Its behavior seems to have changed at some point. When used with a label selector---as we do---the option
--tail
has a default value of 10 instead of -1. We are therefore only seeing the 10 last lines of each Cilium pod, regardless of the--since
option's value.The second commit whitelists map removal log messages for up/downgrade tests, for which they are expected.
/cc @borkmann