-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fix RuntimeKVStoreTest flake #12478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RuntimeKVStoreTest flake #12478
Conversation
For KV store tests, we execute cilium-agent as a standalone binary, in which case logs are printed to stdout. For the Consul KV store, however, we never append stdout to system logs, so logs for that test are not preserved. The Etcd test was fixed in 8864c72, but the Consul test was apparently overlooked. Fixes: f0741a6 ("Tests: Ginkgo test framework") Signed-off-by: Paul Chaignon <paul@cilium.io>
As for most tests, at the end of RuntimeKVStoreTest, we validate the logs don't contain any worrisome messages with: vm.ValidateNoErrorsOnLogs(CurrentGinkgoTestDescription().Duration) CurrentGinkgoTestDescription().Duration includes the execution of all ginkgo.BeforeEach [1]. In RuntimeKVStoreTest, one of our ginkgo.BeforeEach stops the Cilium systemd service (because we run cilium-agent as a standalone binary in the test itself). Stopping Cilium can result in worrisome messages in the logs e.g., if the compilation of BPF programs is terminated abruptly. This in turn makes the tests fail once in a while. To fix this, we can replace CurrentGinkgoTestDescription().Duration with our own "time counter" that doesn't include any of the ginkgo.BeforeEach executions. To validate this fix, I ran the whole RuntimeKVStoreTest with this change 60 times locally and 60 times in the CI (#12419). The tests passed all 120 times. Before applying the fix, the Consul test would fail ~1/30 times, both locally and in CI. 1 - https://github.com/onsi/ginkgo/blob/9c254cb251dc962dc20ca91d0279c870095cfcf9/internal/spec/spec.go#L132-L134 Fixes: #11895 Fixes: 5185789 ("Test: Checks for deadlocks panics in logs per each test.") Related: #12419 Signed-off-by: Paul Chaignon <paul@cilium.io>
retest-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done, hell of a find!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 🕵🏻♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noice!
Runtime tests are passing and this pull request only modifies I have set a reminder to backport this in a week, once we have seen it green in master for a long enough time. |
The KVStore tests did not fail again during the past week, so I think we can now backport. |
The
RuntimeKVStoreTest
tests (both Consul and Etcd) have been failing for a while because ourValidateNoErrorsInLogs
check sometimes finds errors in the logs. This pull request fixes it.The first commit fixes the Consul test for which no Cilium logs were registered.
Wait... what?! Then how is
ValidateNoErrorsInLogs
failing if we don't even print the logs??Second commit's message:
I am not marking this for backport just yet. I'd like to see it run a couple time successfully in master before. It wouldn't be the first time we erroneously think we fixed a flake.
Fixes: #11895
Fixes: #3211
Fixes: #1733