Skip to content

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Jul 9, 2020

The RuntimeKVStoreTest tests (both Consul and Etcd) have been failing for a while because our ValidateNoErrorsInLogs check sometimes finds errors in the logs. This pull request fixes it.

The first commit fixes the Consul test for which no Cilium logs were registered.

Wait... what?! Then how is ValidateNoErrorsInLogs failing if we don't even print the logs??

Second commit's message:

As for most tests, at the end of RuntimeKVStoreTest, we validate the logs don't contain any worrisome messages with:

vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration)

CurrentGinkgoTestDescription().Duration includes the execution of all ginkgo.BeforeEach. In RuntimeKVStoreTest, one of our ginkgo.BeforeEach calls stops the Cilium systemd service (because we run cilium-agent as a standalone binary in the test itself). Stopping Cilium can result in worrisome messages in the logs e.g., if the compilation of BPF programs is terminated abruptly. This in turn makes the tests fail once in a while.

To fix this, we can replace CurrentGinkgoTestDescription().Duration with our own "time counter" that doesn't include any of the ginkgo.BeforeEach executions.

To validate this fix, I ran the whole RuntimeKVStoreTest with this change 60 times locally and 60 times in the CI (#12419). The tests passed all 120 times. Before applying the fix, the Consul test would fail ~1/30 times, both locally and in CI.

I am not marking this for backport just yet. I'd like to see it run a couple time successfully in master before. It wouldn't be the first time we erroneously think we fixed a flake.

Fixes: #11895
Fixes: #3211
Fixes: #1733

pchaigno added 2 commits July 9, 2020 19:37
For KV store tests, we execute cilium-agent as a standalone binary, in
which case logs are printed to stdout. For the Consul KV store, however,
we never append stdout to system logs, so logs for that test are not
preserved.

The Etcd test was fixed in 8864c72, but the Consul test was apparently
overlooked.

Fixes: f0741a6 ("Tests: Ginkgo test framework")
Signed-off-by: Paul Chaignon <paul@cilium.io>
As for most tests, at the end of RuntimeKVStoreTest, we validate the logs
don't contain any worrisome messages with:

    vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration)

CurrentGinkgoTestDescription().Duration includes the execution of all
ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our
ginkgo.BeforeEach stops the Cilium systemd service (because we run
cilium-agent as a standalone binary in the test itself). Stopping Cilium
can result in worrisome messages in the logs e.g., if the compilation of
BPF programs is terminated abruptly. This in turn makes the tests fail
once in a while.

To fix this, we can replace CurrentGinkgoTestDescription().Duration with
our own "time counter" that doesn't include any of the
ginkgo.BeforeEach executions.

To validate this fix, I ran the whole RuntimeKVStoreTest with this
change 60 times locally and 60 times in the CI (#12419). The tests
passed all 120 times. Before applying the fix, the Consul test would
fail ~1/30 times, both locally and in CI.

1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134
Fixes: #11895
Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.")
Related: #12419
Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jul 9, 2020
@pchaigno pchaigno requested a review from a team as a code owner July 9, 2020 18:26
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 9, 2020
@pchaigno pchaigno added the release-note/ci This PR makes changes to the CI. label Jul 9, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 9, 2020
@pchaigno pchaigno requested a review from nebril July 9, 2020 18:27
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 36.99% when pulling 8da4721 on pr/pchaigno/fix-runtimekvstore-tests into 93e4845 on master.

@pchaigno
Copy link
Member Author

pchaigno commented Jul 9, 2020

retest-runtime

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, hell of a find!

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🕵🏻‍♂️

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice!

@pchaigno
Copy link
Member Author

Runtime tests are passing and this pull request only modifies RuntimeKVStoreTest. In addition, #12419 ran repeated tests of this (60 times, by running retest-runtime twice with a kvstore.go file replicated 30 times). Thus, I'm marking as ready to merge.

I have set a reminder to backport this in a week, once we have seen it green in master for a long enough time.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 10, 2020
@nebril nebril merged commit e558100 into master Jul 10, 2020
@nebril nebril deleted the pr/pchaigno/fix-runtimekvstore-tests branch July 10, 2020 09:55
@pchaigno
Copy link
Member Author

The KVStore tests did not fail again during the past week, so I think we can now backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: RuntimeKVStoreTest: Stopping cilium during Runtime tests
5 participants