-
Notifications
You must be signed in to change notification settings - Fork 3.4k
endpoint: drop duplicate RunMetadataResolver call #37086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
endpoint: drop duplicate RunMetadataResolver call #37086
Conversation
Do not explicitly call RunMetadataResolver in case the pod namespace label has not been retrieved yet (e.g., the pod informer has not yet received the creation event for the pod), given that the same operation is always guaranteed to be started immediately afterwards as well. Moreover, this duplicate call can eventually cause the endpoint regeneration state machine to go out of sync, leaving the endpoint stuck in "waiting-to-regenerate" state forever. Specifically, the above problem occurs due to the following sequence of events: 1. A new pod gets created and scheduled onto a node, in turn eventually leading to an endpoint creation API request. 2. When the request is processed, the pod informer has not yet received the pod creation event, hence causing the pod retrieval to fail. 3. At this point, the set of identity labels only contains the init one (because we couldn't retrieve the pod labels), hence triggering the first call to ep.RunMetadataResolver. In turn, this starts a controller in the background (not blocking, as the second parameter is false), which attempts to retrieve the labels again, sets the identity (using the init one in case of retrieval failure) and triggers the endpoint regeneration. The endpoint first transitions into waiting-to-regenerate state); eventually, the regeneration starts to be processed, and the endpoint moves into regenerating state. 4. ep.RunMetadataResolver is called a second time; this step would normally be blocking (the second parameter is true), but in case the labels are already correct due to the previous call to the same function, no regeneration would be required, eventually returning regenTriggered = false; 5. Given that regenTriggered is false, the regeneration is triggered again immediately afterwards [1], transitioning the endpoint into waiting-to-regenerate state again. It is important to note that in this case we propagate the endpoint creation context as part of the regeneration event, rather than using the endpoint context. This appears correct, but ends up being problematic in this specific case. 6. In background, the initial regeneration triggered in 3. completes correctly. The endpoint is attempted to be transitioned into ready state, but that's skipped (correctly) because another regeneration has been scheduled subsequently (in 5.). At the same time, WaitForFirstRegeneration [2] unblocks (given that the regeneration completed), letting the endpoint creation API request to terminate. This, in turn, causes the associated context to be canceled. 7. At this point, there's still the endpoint regeneration triggered in 5. that needs to be executed; however, depending on the timing, the handling may be aborted in [3], given that the context has just been canceled; while this would normally indicate that the endpoint is terminating, hence not causing any problems, in this case it leaves the endpoint stuck into the waiting-to-regenerate state, from which it will never recover (as all subsequent regeneration triggers would be treated as duplicates). Dropping the duplicate call to ep.RunMetadataResolver prevents this problem, as it ensures that we don't unnecessarily trigger a second regeneration associated with the API context. The fallback regeneration should actually never required when operating in Kubernetes mode, because ep.RunMetadataResolver will always cause a new identity to be assigned to the endpoint, in turn triggering regeneration. [1]: https://github.com/cilium/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L569-L582 [2]: https://github.com/isovalent/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L573-L575 [3]: https://github.com/cilium/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/pkg/endpoint/events.go#L64-L71 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
/test |
Thanks for the detailed timeline. It's causing me to thing about the case where the local informer is behind, so we fail the UID check. I wonder if this would be similarly affected -- I'm playing through this in my head now. |
The main difference in that case is that the retrieval from the store is retried a few times and, eventually, a direct get from the API server is also attempted [1]. Overall it seems to me that in that case this bug would be less likely to show up (although still possible), but the fix would address it as well. A few related rough edges of that logic had also already been addressed in #36392 (although that one has not been backported). [1]: Lines 470 to 472 in f602603
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes perfect sense, thanks!
👋 Thank you very much for fixing this! Would it be possible to label it with |
Yes, I think this may fall under the Major bugfixes relevant to the correct operation of Cilium category, and be eligible for backport to v1.15, especially considering that the change is non-invasive. Marking accordingly. |
**What problem does this PR solve?**: This version fixes an upstream bug cilium/cilium#37086 **Which issue(s) this PR fixes**: Fixes # **How Has This Been Tested?**: <!-- Please describe the tests that you ran to verify your changes. Provide output from the tests and any manual steps needed to replicate the tests. --> **Special notes for your reviewer**: <!-- Use this to provide any additional information to the reviewers. This may include: - Best way to review the PR. - Where the author wants the most review attention on. - etc. --> I branched `release/v0.27.x` from the last release and this branch does not include: * #1054 * #1055
Do not explicitly call RunMetadataResolver in case the pod namespace label has not been retrieved yet (e.g., the pod informer has not yet received the creation event for the pod), given that the same operation is always guaranteed to be started immediately afterwards as well. Moreover, this duplicate call can eventually cause the endpoint regeneration state machine to go out of sync, leaving the endpoint stuck in "waiting-to-regenerate" state forever.
Specifically, the above problem occurs due to the following sequence of events:
5. that needs to be executed; however, depending on the timing, the handling may be aborted in [3], given that the context has just been canceled; while this would normally indicate that the endpoint is terminating, hence not causing any problems, in this case it leaves the endpoint stuck into the waiting-to-regenerate state, from which it will never recover (as all subsequent regeneration triggers would be treated as duplicates).
Dropping the duplicate call to ep.RunMetadataResolver prevents this problem, as it ensures that we don't unnecessarily trigger a second regeneration associated with the API context. The fallback regeneration should actually never required when operating in Kubernetes mode, because ep.RunMetadataResolver will always cause a new identity to be assigned to the endpoint, in turn triggering regeneration.
[1]:
cilium/daemon/cmd/endpoint.go
Lines 569 to 582 in a9bcb0c
[2]: https://github.com/isovalent/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L573-L575
[3]:
cilium/pkg/endpoint/events.go
Lines 64 to 71 in a9bcb0c