Skip to content

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Oct 17, 2022

In high-churn clusters, there can be a three-party deadlock between the EndpointManager, the PolicyRepository, and a given Endpoint. One of the "links in the chain" is merely trying to get the container ID and namespace+name of an Endpoint for logging. Which we already have.

So, rather than trying to lock an Endpoint to get it's identifiers again, just use the copy we already have.

Fixes: dae07b5 (endpointmanager: Remove goroutine for ID release)

Signed-off-by: Casey Callendrello cdc@isovalent.com

Fixes a deadlock that can be exposed in high-churn clusters when Pods are deleted rapidly.

@squeed squeed added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Oct 17, 2022
@squeed squeed requested a review from a team as a code owner October 17, 2022 20:57
@squeed squeed requested a review from jrajahalme October 17, 2022 20:57
@christarazi christarazi added area/daemon Impacts operation of the Cilium daemon. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels Oct 17, 2022
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! It seems that dae07b5 introduced this so adding Fixes: ... to the commit would be good.

In high-churn clusters, there can be a three-party deadlock between the
EndpointManager, the PolicyRepository, and a given Endpoint. One of the
"links in the chain" is merely trying to get the container ID and
namespace+name of an Endpoint for logging. Which we already have.

So, rather than trying to lock an Endpoint to get it's identifiers
again, just use the copy we already have.

Fixes: dae07b5 (endpointmanager: Remove goroutine for ID release)
Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@christarazi
Copy link
Member

/test

@squeed
Copy link
Contributor Author

squeed commented Oct 18, 2022

/test
Pretty sure these are flakes; will do some more investigation.

@squeed
Copy link
Contributor Author

squeed commented Oct 18, 2022

/test

@squeed
Copy link
Contributor Author

squeed commented Oct 19, 2022

Fairly certain the test failures are a flake here.

Test-runtime failed with

 [11:43:26] FAIL: loader_test.go:159: LoaderTestSuite.TestCompileAndLoadDefaultEndpoint
	 [11:43:26] 
	 [11:43:26] loader_test.go:160:
	 [11:43:26]     s.testCompileAndLoad(c, &ep)
	 [11:43:26] loader_test.go:154:
	 [11:43:26]     c.Assert(err, IsNil)
	 [11:43:26] ... value *fmt.wrapError = &fmt.wrapError{msg:"Failed to compile bpf_lxc.dbg.o: Command execution failed for [llc -march=bpf -mcpu=v3 -filetype=obj -o /tmp/cilium_3082944025/bpf_lxc.dbg.o]: context deadline exceeded", err:(*fmt.wrapError)(0xc00057e120)} ("Failed to compile bpf_lxc.dbg.o: Command execution failed for [llc -march=bpf -mcpu=v3 -filetype=obj -o /tmp/cilium_3082944025/bpf_lxc.dbg.o]: context deadline exceeded")

and k8s-1.16-kernel-4.9 fails two of the connectivity tests. Given that this just changes an error log message, they're not related.

@julianwiedmann
Copy link
Member

and k8s-1.16-kernel-4.9 fails two of the connectivity tests. Given that this just changes an error log message, they're not related.

Yep, that's #21735.

Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant :-)

@squeed squeed added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 19, 2022
@aanm aanm merged commit 061e55f into cilium:master Oct 19, 2022
@joestringer
Copy link
Member

Nice work! In future can you put the relevant stack traces into the commit message? This can help others to review and learn about this fix, and also makes it easier to search for these stack traces in future in case they hit this issue independently. Then we can compare another failure vs. this fix and confirm whether this fixes the deadlock or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants