test: Fix kubectl log retrieval for badLogMessages #10717

pchaigno · 2020-03-26T13:04:18Z

After each test case, we retrieve Cilium logs to look for known-bad log messages. Daniel recently noticed that tests containing known-bad message logs are passing.

The issue is in the kubectl logs command we use to retrieve the logs. Its behavior seems to have changed at some point. When used with a label selector---as we do---the option --tail has a default value of 10 instead of -1. We are therefore only seeing the 10 last lines of each Cilium pod, regardless of the --since option's value.

The second commit whitelists map removal log messages for up/downgrade tests, for which they are expected.

/cc @borkmann

coveralls · 2020-03-26T13:32:15Z

Coverage increased (+0.02%) to 45.528% when pulling 93570d1 on pr/pchaigno/fix-bad-log-messages into 60bd7f5 on master.

pchaigno · 2020-03-27T13:47:12Z

The tests are currently failing because of the padding issue introduced in #10626 and fixed in #10728. I'll rebase and restart the tests once that is merged. We'll also need a way to whitelist bad log messages, at least for down/upgrade tests (for which some map removal is expected).

raybejjani

I'm mostly ok with this going in but I'm worried about how it will evolve over time. I can be convinced either way, so this isn't a hard "request changes"

raybejjani · 2020-03-30T09:33:52Z

test/helpers/kubectl.go

@@ -2337,12 +2337,18 @@ func (kub *Kubectl) CiliumCheckReport(ctx context.Context) {
 // `deadlocks` or `segmentation faults` messages. In case of any of these
 // messages, it'll mark the test as failed.
 func (kub *Kubectl) ValidateNoErrorsInLogs(duration time.Duration) {
+	kub.ValidateNoErrorsInLogsWhitelist(duration, map[string]bool{})


You should be able to pass in the whitelist as a nil. It's not a big deal. We have every combination of style in the code already.

raybejjani · 2020-03-30T09:41:32Z

test/helpers/utils.go

@@ -430,9 +430,12 @@ func CanRunK8sVersion(ciliumVersion, k8sVersionStr string) (bool, error) {
 // failIfContainsBadLogMsg makes a test case to fail if any message from
 // given log messages contains an entry from badLogMessages (map key) AND
 // does not contain ignore messages (map value).
-func failIfContainsBadLogMsg(logs string) {
+func failIfContainsBadLogMsg(logs string, whitelist map[string]bool) {


Doesn't this replicate what the values in the badLogMessages does already? I would prefer to pass in the badLogMessages instead of added a slightly different whitelist mechanism. On the other hand, I'm not sure if that system is as flexible or simple as you need it since all you wanted here was to suppress blacklisting one message :/ (this is where I'm on the fence. Your new whitelist is a lot more clear)

I wasn't sure which way to go either. If we pass badLogMessages, we would likely still want two ValidateNoErrorsInLogs{,Default} functions to avoid changing all call sites, but we wouldn't need to check the whitelist at each loop iteration.

However, if we pass badLogMessages, we'll need to change the default map for K8sUpdates. If we create a new map specifically for K8sUpdates, we risk missing future additions to badLogMessages. From my (still limited) understanding of Go, we would need to make a deep copy of badLogMessages to remove one entry (correct?), which sounds more involved than using a whitelist mechanism.

While whitelist may seem clear, I think we should not change how this func works, and explicitly pass badLogMessages as an argument, as Ray suggested.

I think we should add a function which will copy global badLogMessages and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.

As of this change, this function effectively gets two maps and calculates a difference between them, I think it should be on the caller to define "bad" log messages, and adding a whitelist adds unnecessary complexity into the function and makes it less readable.

We shouldn't have used a global variable in the first place, but adding whitelist seems like a hacky workaround instead of making the code better structured.

yup, that is exactly what went through my mind too. You're also correct about the map deep-copy (unless they made copy do something in recent versions).
I guess my thinking came down to two questions (I'm not as worried as you about changing all the callsites despite it making the PR look much bigger):
1- what will people copy-paste?
2- will we ever debug a test because we think the whitelist is only the parameter, but then the built-in one is also taking effect.

In any case, your point about missing updates to the original badLogMessages wins out, I think. I'll approve :)

I think we should add a function which will copy global badLogMessages and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.

Makes sense. I'll push an updated branch in a bit.

I've updated the branch with:

A GetBadLogMessages() function to get a copy of the badLogMessages.

A ValidateListOfErrorsInLogs function to pass a custom list of messages. ValidateNoErrorsInLogs calls ValidateListOfErrorsInLogs with the default badLogMessages.

nebril · 2020-03-30T13:01:35Z

test/helpers/utils.go

@@ -430,9 +430,12 @@ func CanRunK8sVersion(ciliumVersion, k8sVersionStr string) (bool, error) {
 // failIfContainsBadLogMsg makes a test case to fail if any message from
 // given log messages contains an entry from badLogMessages (map key) AND
 // does not contain ignore messages (map value).
-func failIfContainsBadLogMsg(logs string) {
+func failIfContainsBadLogMsg(logs string, whitelist map[string]bool) {


While whitelist may seem clear, I think we should not change how this func works, and explicitly pass badLogMessages as an argument, as Ray suggested.

I think we should add a function which will copy global badLogMessages and return the copy, this way we make sure that we always work with fresh copy of the map, which can be manipulated per test suite.

As of this change, this function effectively gets two maps and calculates a difference between them, I think it should be on the caller to define "bad" log messages, and adding a whitelist adds unnecessary complexity into the function and makes it less readable.

We shouldn't have used a global variable in the first place, but adding whitelist seems like a hacky workaround instead of making the code better structured.

raybejjani · 2020-03-30T13:51:08Z

omfg github, why can't you update a page!?!?!!
I missed nebril's comment so, um, now I guess we can play musical chairs with opinions...

pchaigno · 2020-03-30T15:52:37Z

test/helpers/cons.go

+		exceptionsCopy := make([]string, len(exceptions))
+		copy(exceptionsCopy, exceptions)
+		mapCopy[badMsg] = exceptionsCopy
+	}


Probably worth for someone more familiar with Go to double check this deep copy ;-)

raybejjani · 2020-03-30T15:55:34Z

test/helpers/cons.go

+		exceptionsCopy := make([]string, len(exceptions))
+		copy(exceptionsCopy, exceptions)
+		mapCopy[badMsg] = exceptionsCopy
+	}


pchaigno · 2020-03-31T05:40:11Z

CI was hitting #10760.

test-me-please

pchaigno · 2020-03-31T07:42:53Z

CI was hitting #10774.

test-focus K8sUpdates.*

pchaigno · 2020-03-31T09:07:28Z

test-me-please (known flake, all tests passed otherwise https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/18353/)

raybejjani · 2020-03-31T11:23:37Z

test-me-please

After each test case, we retrieve Cilium logs to look for known-bad log messages. Daniel recently noticed that tests containing known-bad message logs are passing. The issue is in the kubectl logs command we use to retrieve the logs. Its behavior seems to have changed at some point. When used with a label selector---as we do---the option --tail has a default value of 10 instead of -1. We are therefore only seeing the 10 last lines of each Cilium pod, regardless of the --since option's value. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Paul Chaignon <paul@cilium.io>

This commit defines a new function ValidateNoErrorsInLogsWhitelist() to enable whitelisting of some (ordinarily) known-bad messages when validating the logs. We then use this new function to whitelist 'Removing map to allow for property upgrade' messages during up/downgrade tests, during which such messages are expected. Signed-off-by: Paul Chaignon <paul@cilium.io>

pchaigno · 2020-03-31T12:59:04Z

test-me-please

aanm

nice finding

nebril

Thanks!

joestringer · 2020-04-01T22:25:40Z

@pchaigno What's the general rationale for backporting this PR and have you looked at the potential required PRs to make it test successfully? Also worth considering whether it makes sense for only backporting to the latest (v1.7) release and not the earlier releases.

I noticed the following potentially related PRs from backport perspective, #10680 ; #10699 ; #10728.

The documented rationale to backport PRs to the latest release is "all bugfixes", then older releases have more strict guidelines:
http://docs.cilium.io/en/latest/contributing/release/backports/#backport-criteria-for-current-minor-release

I'm happy to be convinced that we should take this change (we have taken CI improvements back to the latest release in the past), but right now this seems like a potentially large amount of backporting work so I don't want to go ahead and do it without considering whether it's worthwhile.

joestringer · 2020-04-01T22:28:09Z

Looking again it seems like the first commit is a no-brainer while the second commit looks dubious for backport. We typically backport on a per-PR basis rather than a per-commit basis but perhaps we should make an exception for this PR?

Anyway, I am ignoring this PR for now from backporting perspective.

pchaigno · 2020-04-02T06:03:38Z

The documented rationale to backport PRs to the latest release is "all bugfixes", then older releases have more strict guidelines:
http://docs.cilium.io/en/latest/contributing/release/backports/#backport-criteria-for-current-minor-release

Ah, I was missing these guidelines. So, yes, only first commit should be backported to only 1.7 (second commit is not needed since removingMap wasn't part of bad log messages in v1.7). I'm on backporting duties this week so I was planning to take care of this pull request myself.

This comment has been minimized.

Sign in to view

maintainer-s-little-helper bot added the dont-merge/needs-release-note label Mar 26, 2020

pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from 2ea5384 to 0aff768 Compare March 26, 2020 13:31

pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch 2 times, most recently from 2c89376 to 854d006 Compare March 27, 2020 11:19

pchaigno changed the title ~~Testing fix for badLogMessages~~ test: Fix kubectl log retrieval for badLogMessages Mar 27, 2020

pchaigno added area/CI Continuous Integration testing issue or flake dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. release-note/misc This PR makes changes that have no direct user impact. wip labels Mar 27, 2020

maintainer-s-little-helper bot removed the dont-merge/needs-release-note label Mar 27, 2020

pchaigno mentioned this pull request Mar 27, 2020

daemon: Create all global maps in cilium-agent #10626

Merged

pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from 854d006 to 126a51e Compare March 27, 2020 18:04

pchaigno added the kind/bug/CI This is a bug in the testing code. label Mar 27, 2020

pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch 2 times, most recently from 74aa57d to a15ec1a Compare March 29, 2020 15:34

pchaigno marked this pull request as ready for review March 29, 2020 15:34

pchaigno requested a review from a team as a code owner March 29, 2020 15:34

pchaigno added needs-backport/1.5 and removed dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. wip labels Mar 29, 2020

raybejjani suggested changes Mar 30, 2020

View reviewed changes

pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from a15ec1a to 432ceab Compare March 30, 2020 12:40

nebril requested changes Mar 30, 2020

View reviewed changes

raybejjani approved these changes Mar 30, 2020

View reviewed changes

pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from 432ceab to a2e4c68 Compare March 30, 2020 15:46

pchaigno commented Mar 30, 2020

View reviewed changes

raybejjani approved these changes Mar 30, 2020

View reviewed changes

pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from a2e4c68 to 49b2d09 Compare March 30, 2020 16:02

pchaigno added 2 commits March 31, 2020 14:57

pchaigno force-pushed the pr/pchaigno/fix-bad-log-messages branch from 49b2d09 to 93570d1 Compare March 31, 2020 12:58

aanm approved these changes Apr 1, 2020

View reviewed changes

aanm added the pending-review label Apr 1, 2020

nebril approved these changes Apr 1, 2020

View reviewed changes

aanm merged commit 87c51b2 into master Apr 1, 2020

aanm deleted the pr/pchaigno/fix-bad-log-messages branch April 1, 2020 08:17

joestringer mentioned this pull request Apr 1, 2020

v1.7 backports 2020-04-01 #10818

Merged

pchaigno added backport-pending/1.7 and removed needs-backport/1.5 labels Apr 2, 2020

pchaigno mentioned this pull request Apr 2, 2020

v1.7 backports 2020-04-02 #10833

Merged

joestringer added backport-done/1.7 and removed backport-pending/1.7 labels Apr 7, 2020

test: Fix kubectl log retrieval for badLogMessages #10717

test: Fix kubectl log retrieval for badLogMessages #10717

Uh oh!

Conversation

pchaigno commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

coveralls commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pchaigno commented Mar 27, 2020

Uh oh!

raybejjani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raybejjani Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raybejjani commented Mar 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pchaigno commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pchaigno commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pchaigno commented Mar 31, 2020 • edited by raybejjani Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raybejjani commented Mar 31, 2020

Uh oh!

pchaigno commented Mar 31, 2020

Uh oh!

aanm left a comment

Choose a reason for hiding this comment

Uh oh!

nebril left a comment

Choose a reason for hiding this comment

Uh oh!

joestringer commented Apr 1, 2020

Uh oh!

joestringer commented Apr 1, 2020

Uh oh!

pchaigno commented Apr 2, 2020

Uh oh!

Uh oh!

pchaigno commented Mar 26, 2020 •

edited

Loading

coveralls commented Mar 26, 2020 •

edited

Loading

raybejjani Mar 30, 2020 •

edited

Loading

pchaigno commented Mar 31, 2020 •

edited

Loading

pchaigno commented Mar 31, 2020 •

edited

Loading

pchaigno commented Mar 31, 2020 •

edited by raybejjani

Loading