-
Notifications
You must be signed in to change notification settings - Fork 3.4k
EndpointManager: fix deadlock when releasing an endpoint #21771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find! It seems that dae07b5 introduced this so adding Fixes: ...
to the commit would be good.
162d562
to
e023308
Compare
In high-churn clusters, there can be a three-party deadlock between the EndpointManager, the PolicyRepository, and a given Endpoint. One of the "links in the chain" is merely trying to get the container ID and namespace+name of an Endpoint for logging. Which we already have. So, rather than trying to lock an Endpoint to get it's identifiers again, just use the copy we already have. Fixes: dae07b5 (endpointmanager: Remove goroutine for ID release) Signed-off-by: Casey Callendrello <cdc@isovalent.com>
e023308
to
1e37eb3
Compare
/test |
/test |
/test |
Fairly certain the test failures are a flake here. Test-runtime failed with
and k8s-1.16-kernel-4.9 fails two of the connectivity tests. Given that this just changes an error log message, they're not related. |
Yep, that's #21735. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant :-)
Nice work! In future can you put the relevant stack traces into the commit message? This can help others to review and learn about this fix, and also makes it easier to search for these stack traces in future in case they hit this issue independently. Then we can compare another failure vs. this fix and confirm whether this fixes the deadlock or not. |
In high-churn clusters, there can be a three-party deadlock between the EndpointManager, the PolicyRepository, and a given Endpoint. One of the "links in the chain" is merely trying to get the container ID and namespace+name of an Endpoint for logging. Which we already have.
So, rather than trying to lock an Endpoint to get it's identifiers again, just use the copy we already have.
Fixes: dae07b5 (endpointmanager: Remove goroutine for ID release)
Signed-off-by: Casey Callendrello cdc@isovalent.com