Skip to content

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Mar 26, 2020

After each test case, we retrieve Cilium logs to look for known-bad log messages. Daniel recently noticed that tests containing known-bad message logs are passing.

The issue is in the kubectl logs command we use to retrieve the logs. Its behavior seems to have changed at some point. When used with a label selector---as we do---the option --tail has a default value of 10 instead of -1. We are therefore only seeing the 10 last lines of each Cilium pod, regardless of the --since option's value.

The second commit whitelists map removal log messages for up/downgrade tests, for which they are expected.

/cc @borkmann

@maintainer-s-little-helper

This comment has been minimized.

@coveralls
Copy link

coveralls commented Mar 26, 2020

Coverage Status

Coverage increased (+0.02%) to 45.528% when pulling 93570d1 on pr/pchaigno/fix-bad-log-messages into 60bd7f5 on master.

@pchaigno pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch 2 times, most recently from 2c89376 to 854d006 Compare March 27, 2020 11:19
@pchaigno pchaigno changed the title Testing fix for badLogMessages test: Fix kubectl log retrieval for badLogMessages Mar 27, 2020
@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. release-note/misc This PR makes changes that have no direct user impact. wip labels Mar 27, 2020
@pchaigno
Copy link
Member Author

The tests are currently failing because of the padding issue introduced in #10626 and fixed in #10728. I'll rebase and restart the tests once that is merged. We'll also need a way to whitelist bad log messages, at least for down/upgrade tests (for which some map removal is expected).

@pchaigno pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from 854d006 to 126a51e Compare March 27, 2020 18:04
@pchaigno pchaigno added the kind/bug/CI This is a bug in the testing code. label Mar 27, 2020
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch 2 times, most recently from 74aa57d to a15ec1a Compare March 29, 2020 15:34
@pchaigno pchaigno marked this pull request as ready for review March 29, 2020 15:34
@pchaigno pchaigno requested a review from a team as a code owner March 29, 2020 15:34
@pchaigno pchaigno added needs-backport/1.5 and removed dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. wip labels Mar 29, 2020
Copy link
Contributor

@raybejjani raybejjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly ok with this going in but I'm worried about how it will evolve over time. I can be convinced either way, so this isn't a hard "request changes"

@@ -2337,12 +2337,18 @@ func (kub *Kubectl) CiliumCheckReport(ctx context.Context) {
// `deadlocks` or `segmentation faults` messages. In case of any of these
// messages, it'll mark the test as failed.
func (kub *Kubectl) ValidateNoErrorsInLogs(duration time.Duration) {
kub.ValidateNoErrorsInLogsWhitelist(duration, map[string]bool{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to pass in the whitelist as a nil. It's not a big deal. We have every combination of style in the code already.

@@ -430,9 +430,12 @@ func CanRunK8sVersion(ciliumVersion, k8sVersionStr string) (bool, error) {
// failIfContainsBadLogMsg makes a test case to fail if any message from
// given log messages contains an entry from badLogMessages (map key) AND
// does not contain ignore messages (map value).
func failIfContainsBadLogMsg(logs string) {
func failIfContainsBadLogMsg(logs string, whitelist map[string]bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this replicate what the values in the badLogMessages does already? I would prefer to pass in the badLogMessages instead of added a slightly different whitelist mechanism. On the other hand, I'm not sure if that system is as flexible or simple as you need it since all you wanted here was to suppress blacklisting one message :/ (this is where I'm on the fence. Your new whitelist is a lot more clear)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure which way to go either. If we pass badLogMessages, we would likely still want two ValidateNoErrorsInLogs{,Default} functions to avoid changing all call sites, but we wouldn't need to check the whitelist at each loop iteration.

However, if we pass badLogMessages, we'll need to change the default map for K8sUpdates. If we create a new map specifically for K8sUpdates, we risk missing future additions to badLogMessages. From my (still limited) understanding of Go, we would need to make a deep copy of badLogMessages to remove one entry (correct?), which sounds more involved than using a whitelist mechanism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While whitelist may seem clear, I think we should not change how this func works, and explicitly pass badLogMessages as an argument, as Ray suggested.

I think we should add a function which will copy global badLogMessages and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.

As of this change, this function effectively gets two maps and calculates a difference between them, I think it should be on the caller to define "bad" log messages, and adding a whitelist adds unnecessary complexity into the function and makes it less readable.

We shouldn't have used a global variable in the first place, but adding whitelist seems like a hacky workaround instead of making the code better structured.

Copy link
Contributor

@raybejjani raybejjani Mar 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that is exactly what went through my mind too. You're also correct about the map deep-copy (unless they made copy do something in recent versions).
I guess my thinking came down to two questions (I'm not as worried as you about changing all the callsites despite it making the PR look much bigger):
1- what will people copy-paste?
2- will we ever debug a test because we think the whitelist is only the parameter, but then the built-in one is also taking effect.

In any case, your point about missing updates to the original badLogMessages wins out, I think. I'll approve :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a function which will copy global badLogMessages and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.

Makes sense. I'll push an updated branch in a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the branch with:

  • A GetBadLogMessages() function to get a copy of the badLogMessages.
  • A ValidateListOfErrorsInLogs function to pass a custom list of messages. ValidateNoErrorsInLogs calls ValidateListOfErrorsInLogs with the default badLogMessages.

@pchaigno pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from a15ec1a to 432ceab Compare March 30, 2020 12:40
@@ -430,9 +430,12 @@ func CanRunK8sVersion(ciliumVersion, k8sVersionStr string) (bool, error) {
// failIfContainsBadLogMsg makes a test case to fail if any message from
// given log messages contains an entry from badLogMessages (map key) AND
// does not contain ignore messages (map value).
func failIfContainsBadLogMsg(logs string) {
func failIfContainsBadLogMsg(logs string, whitelist map[string]bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While whitelist may seem clear, I think we should not change how this func works, and explicitly pass badLogMessages as an argument, as Ray suggested.

I think we should add a function which will copy global badLogMessages and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.

As of this change, this function effectively gets two maps and calculates a difference between them, I think it should be on the caller to define "bad" log messages, and adding a whitelist adds unnecessary complexity into the function and makes it less readable.

We shouldn't have used a global variable in the first place, but adding whitelist seems like a hacky workaround instead of making the code better structured.

@raybejjani
Copy link
Contributor

omfg github, why can't you update a page!?!?!!
I missed nebril's comment so, um, now I guess we can play musical chairs with opinions...

@pchaigno pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from 432ceab to a2e4c68 Compare March 30, 2020 15:46
exceptionsCopy := make([]string, len(exceptions))
copy(exceptionsCopy, exceptions)
mapCopy[badMsg] = exceptionsCopy
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth for someone more familiar with Go to double check this deep copy ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

exceptionsCopy := make([]string, len(exceptions))
copy(exceptionsCopy, exceptions)
mapCopy[badMsg] = exceptionsCopy
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pchaigno pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from a2e4c68 to 49b2d09 Compare March 30, 2020 16:02
@pchaigno
Copy link
Member Author

pchaigno commented Mar 31, 2020

CI was hitting #10760.

test-me-please

@pchaigno
Copy link
Member Author

pchaigno commented Mar 31, 2020

CI was hitting #10774.

test-focus K8sUpdates.*

@pchaigno
Copy link
Member Author

pchaigno commented Mar 31, 2020

test-me-please (known flake, all tests passed otherwise https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/18353/)

@raybejjani
Copy link
Contributor

test-me-please

After each test case, we retrieve Cilium logs to look for known-bad log
messages. Daniel recently noticed that tests containing known-bad
message logs are passing.

The issue is in the kubectl logs command we use to retrieve the logs.
Its behavior seems to have changed at some point. When used with a label
selector---as we do---the option --tail has a default value of 10
instead of -1. We are therefore only seeing the 10 last lines of each
Cilium pod, regardless of the --since option's value.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Paul Chaignon <paul@cilium.io>
This commit defines a new function ValidateNoErrorsInLogsWhitelist() to
enable whitelisting of some (ordinarily) known-bad messages when
validating the logs. We then use this new function to whitelist 'Removing
map to allow for property upgrade' messages during up/downgrade tests,
during which such messages are expected.

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from 49b2d09 to 93570d1 Compare March 31, 2020 12:58
@pchaigno
Copy link
Member Author

test-me-please

Copy link
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice finding

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@aanm aanm merged commit 87c51b2 into master Apr 1, 2020
@aanm aanm deleted the pr/pchaigno/fix-bad-log-messages branch April 1, 2020 08:17
@joestringer
Copy link
Member

@pchaigno What's the general rationale for backporting this PR and have you looked at the potential required PRs to make it test successfully? Also worth considering whether it makes sense for only backporting to the latest (v1.7) release and not the earlier releases.

I noticed the following potentially related PRs from backport perspective, #10680 ; #10699 ; #10728.

The documented rationale to backport PRs to the latest release is "all bugfixes", then older releases have more strict guidelines:
http://docs.cilium.io/en/latest/contributing/release/backports/#backport-criteria-for-current-minor-release

I'm happy to be convinced that we should take this change (we have taken CI improvements back to the latest release in the past), but right now this seems like a potentially large amount of backporting work so I don't want to go ahead and do it without considering whether it's worthwhile.

@joestringer
Copy link
Member

Looking again it seems like the first commit is a no-brainer while the second commit looks dubious for backport. We typically backport on a per-PR basis rather than a per-commit basis but perhaps we should make an exception for this PR?

Anyway, I am ignoring this PR for now from backporting perspective.

@pchaigno
Copy link
Member Author

pchaigno commented Apr 2, 2020

The documented rationale to backport PRs to the latest release is "all bugfixes", then older releases have more strict guidelines:
http://docs.cilium.io/en/latest/contributing/release/backports/#backport-criteria-for-current-minor-release

Ah, I was missing these guidelines. So, yes, only first commit should be backported to only 1.7 (second commit is not needed since removingMap wasn't part of bad log messages in v1.7). I'm on backporting duties this week so I was planning to take care of this pull request myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake kind/bug/CI This is a bug in the testing code. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants