Skip to content

Conversation

glrf
Copy link
Contributor

@glrf glrf commented Nov 21, 2024

The TestNodeManagerEmitStatus test did not synchronize the initial status check. This caused a fairly rare test flake where the starting node manager set the state before the test read the initial state. This caused a deadlock, because the test waited for the status to be updated by the node manager and the node manager waited for the fake NodeHandler to return.

We fix this deadlock and simplify the synchronization logic by replacing two interleaving goroutines by a single goroutine. This way we make sure that we read the initial empty status before the node manager can update it.

CC @bimmlerd
CC @tommyp1ckles

 Fix flake in node manager `TestNodeManagerEmitStatus` test

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 21, 2024
@glrf glrf marked this pull request as ready for review November 21, 2024 13:51
@glrf glrf requested a review from a team as a code owner November 21, 2024 13:51
@glrf glrf requested a review from thorn3r November 21, 2024 13:51
@bimmlerd bimmlerd added kind/bug/CI This is a bug in the testing code. area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. area/agent Cilium agent related. labels Nov 21, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 21, 2024
@glrf glrf force-pushed the fix-flake-node-manager-emit-status branch 2 times, most recently from 32150e9 to 762cbb2 Compare November 21, 2024 14:13
The `TestNodeManagerEmitStatus` test did not synchronize the initial
status check. This caused a fairly rare test flake where the starting
node manager set the state before the test read the initial state. This
caused a deadlock, because the test waited for the status to be updated
by the node manager and the node manager waited for the fake
NodeHandler to return.

We fix this deadlock and simplify the synchronization logic by
replacing two interleaving goroutines by a single goroutine. This way
we make sure that we read the initial empty status before the node manager
can update it.

Fixes: d404ee7 ("node/manager: test: avoid future race in status")
Fixes: b343558 ("nodemanager: rewrite health status test for healthv2.")

Co-authored-by: David Bimmler <david.bimmler@isovalent.com>
Signed-off-by: Fabian Fischer <fabian.fischer@isovalent.com>
@glrf glrf force-pushed the fix-flake-node-manager-emit-status branch from 762cbb2 to ec67f82 Compare November 21, 2024 15:14
Copy link
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏 thanks

@glrf
Copy link
Contributor Author

glrf commented Nov 22, 2024

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 26, 2024
@tommyp1ckles tommyp1ckles added this pull request to the merge queue Nov 26, 2024
Merged via the queue into cilium:main with commit 55f7f72 Nov 26, 2024
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent Cilium agent related. area/CI Continuous Integration testing issue or flake kind/bug/CI This is a bug in the testing code. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants