Skip to content

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jan 20, 2025

Do not explicitly call RunMetadataResolver in case the pod namespace label has not been retrieved yet (e.g., the pod informer has not yet received the creation event for the pod), given that the same operation is always guaranteed to be started immediately afterwards as well. Moreover, this duplicate call can eventually cause the endpoint regeneration state machine to go out of sync, leaving the endpoint stuck in "waiting-to-regenerate" state forever.

Specifically, the above problem occurs due to the following sequence of events:

  1. A new pod gets created and scheduled onto a node, in turn eventually leading to an endpoint creation API request.
  2. When the request is processed, the pod informer has not yet received the pod creation event, hence causing the pod retrieval to fail.
  3. At this point, the set of identity labels only contains the init one (because we couldn't retrieve the pod labels), hence triggering the first call to ep.RunMetadataResolver. In turn, this starts a controller in the background (not blocking, as the second parameter is false), which attempts to retrieve the labels again, sets the identity (using the init one in case of retrieval failure) and triggers the endpoint regeneration. The endpoint first transitions into waiting-to-regenerate state); eventually, the regeneration starts to be processed, and the endpoint moves into regenerating state.
  4. ep.RunMetadataResolver is called a second time; this step would normally be blocking (the second parameter is true), but in case the labels are already correct due to the previous call to the same function, no regeneration would be required, eventually returning regenTriggered = false;
  5. Given that regenTriggered is false, the regeneration is triggered again immediately afterwards [1], transitioning the endpoint into waiting-to-regenerate state again. It is important to note that in this case we propagate the endpoint creation context as part of the regeneration event, rather than using the endpoint context. This appears correct, but ends up being problematic in this specific case.
  6. In background, the initial regeneration triggered in 3. completes correctly. The endpoint is attempted to be transitioned into ready state, but that's skipped (correctly) because another regeneration has been scheduled subsequently (in 5.). At the same time, WaitForFirstRegeneration [2] unblocks (given that the regeneration completed), letting the endpoint creation API request to terminate. This, in turn, causes the associated context to be canceled.
  7. At this point, there's still the endpoint regeneration triggered in
    5. that needs to be executed; however, depending on the timing, the handling may be aborted in [3], given that the context has just been canceled; while this would normally indicate that the endpoint is terminating, hence not causing any problems, in this case it leaves the endpoint stuck into the waiting-to-regenerate state, from which it will never recover (as all subsequent regeneration triggers would be treated as duplicates).

Dropping the duplicate call to ep.RunMetadataResolver prevents this problem, as it ensures that we don't unnecessarily trigger a second regeneration associated with the API context. The fallback regeneration should actually never required when operating in Kubernetes mode, because ep.RunMetadataResolver will always cause a new identity to be assigned to the endpoint, in turn triggering regeneration.

[1]:

if !regenTriggered {
regenMetadata := &regeneration.ExternalRegenerationMetadata{
Reason: "Initial build on endpoint creation",
ParentContext: ctx,
RegenerationLevel: regeneration.RegenerateWithDatapath,
}
build, err := ep.SetRegenerateStateIfAlive(regenMetadata)
if err != nil {
return d.errorDuringCreation(ep, err)
}
if build {
ep.Regenerate(regenMetadata)
}
}

[2]: https://github.com/isovalent/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L573-L575
[3]:
// We should only queue the request after we use all the endpoint's
// lock/unlock. Otherwise this can get a deadlock if the endpoint is
// being deleted at the same time. More info PR-1777.
doneFunc, err := e.owner.QueueEndpointBuild(regenContext.parentContext, uint64(e.ID))
if err != nil {
if !errors.Is(err, context.Canceled) {
e.getLogger().WithError(err).Warning("unable to queue endpoint build")
}

Fix bug potentially causing newly added endpoints to remain stuck in waiting-to-regenerate state forever, causing traffic from/to that endpoint to be incorrectly dropped.

Do not explicitly call RunMetadataResolver in case the pod namespace
label has not been retrieved yet (e.g., the pod informer has not yet
received the creation event for the pod), given that the same operation
is always guaranteed to be started immediately afterwards as well.
Moreover, this duplicate call can eventually cause the endpoint
regeneration state machine to go out of sync, leaving the endpoint
stuck in "waiting-to-regenerate" state forever.

Specifically, the above problem occurs due to the following sequence
of events:

1. A new pod gets created and scheduled onto a node, in turn eventually
   leading to an endpoint creation API request.
2. When the request is processed, the pod informer has not yet received
   the pod creation event, hence causing the pod retrieval to fail.
3. At this point, the set of identity labels only contains the init one
   (because we couldn't retrieve the pod labels), hence triggering the
   first call to ep.RunMetadataResolver. In turn, this starts a controller
   in the background (not blocking, as the second parameter is false),
   which attempts to retrieve the labels again, sets the identity (using
   the init one in case of retrieval failure) and triggers the endpoint
   regeneration. The endpoint first transitions into waiting-to-regenerate
   state); eventually, the regeneration starts to be processed, and the
   endpoint moves into regenerating state.
4. ep.RunMetadataResolver is called a second time; this step would
   normally be blocking (the second parameter is true), but in case
   the labels are already correct due to the previous call to the same
   function, no regeneration would be required, eventually returning
   regenTriggered = false;
5. Given that regenTriggered is false, the regeneration is triggered
   again immediately afterwards [1], transitioning the endpoint into
   waiting-to-regenerate state again. It is important to note that in
   this case we propagate the endpoint creation context as part of the
   regeneration event, rather than using the endpoint context. This
   appears correct, but ends up being problematic in this specific case.
6. In background, the initial regeneration triggered in 3. completes
   correctly. The endpoint is attempted to be transitioned into ready
   state, but that's skipped (correctly) because another regeneration
   has been scheduled subsequently (in 5.). At the same time,
   WaitForFirstRegeneration [2] unblocks (given that the regeneration
   completed), letting the endpoint creation API request to terminate.
   This, in turn, causes the associated context to be canceled.
7. At this point, there's still the endpoint regeneration triggered in
   5. that needs to be executed; however, depending on the timing, the
   handling may be aborted in [3], given that the context has just been
   canceled; while this would normally indicate that the endpoint is
   terminating, hence not causing any problems, in this case it leaves
   the endpoint stuck into the waiting-to-regenerate state, from which
   it will never recover (as all subsequent regeneration triggers would
   be treated as duplicates).

Dropping the duplicate call to ep.RunMetadataResolver prevents this
problem, as it ensures that we don't unnecessarily trigger a second
regeneration associated with the API context. The fallback regeneration
should actually never required when operating in Kubernetes mode,
because ep.RunMetadataResolver will always cause a new identity to
be assigned to the endpoint, in turn triggering regeneration.

[1]: https://github.com/cilium/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L569-L582
[2]: https://github.com/isovalent/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L573-L575
[3]: https://github.com/cilium/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/pkg/endpoint/events.go#L64-L71

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. area/daemon Impacts operation of the Cilium daemon. release-note/bug This PR fixes an issue in a previous release of Cilium. affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch affects/v1.16 This issue affects v1.16 branch needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch affects/v1.17 This issue affects v1.17 branch labels Jan 20, 2025
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 marked this pull request as ready for review January 20, 2025 13:25
@giorio94 giorio94 requested a review from a team as a code owner January 20, 2025 13:25
@giorio94 giorio94 requested review from squeed and aanm January 20, 2025 13:25
@squeed
Copy link
Contributor

squeed commented Jan 20, 2025

Thanks for the detailed timeline. It's causing me to thing about the case where the local informer is behind, so we fail the UID check. I wonder if this would be similarly affected -- I'm playing through this in my head now.

@giorio94
Copy link
Member Author

Thanks for the detailed timeline. It's causing me to thing about the case where the local informer is behind, so we fail the UID check. I wonder if this would be similarly affected -- I'm playing through this in my head now.

The main difference in that case is that the retrieval from the store is retried a few times and, eventually, a direct get from the API server is also attempted [1]. Overall it seems to me that in that case this bug would be less likely to show up (although still possible), but the fix would address it as well. A few related rough edges of that logic had also already been addressed in #36392 (although that one has not been backported).

[1]:

if newPod, err2 := d.clientset.Slim().CoreV1().Pods(ep.K8sNamespace).Get(
ctx, ep.K8sPodName, metav1.GetOptions{},
); err2 != nil {

Copy link
Contributor

@squeed squeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes perfect sense, thanks!

@squeed squeed added this pull request to the merge queue Jan 21, 2025
Merged via the queue into cilium:main with commit 6d90e39 Jan 21, 2025
72 checks passed
@rastislavs rastislavs mentioned this pull request Jan 21, 2025
45 tasks
@rastislavs rastislavs added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 21, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Jan 22, 2025
@rastislavs rastislavs mentioned this pull request Jan 22, 2025
19 tasks
@rastislavs rastislavs added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Jan 22, 2025
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Jan 24, 2025
@antonipp
Copy link
Contributor

👋 Thank you very much for fixing this! Would it be possible to label it with needs-backport/1.15 as well, so that it gets backported there too?

@giorio94
Copy link
Member Author

👋 Thank you very much for fixing this! Would it be possible to label it with needs-backport/1.15 as well, so that it gets backported there too?

Yes, I think this may fall under the Major bugfixes relevant to the correct operation of Cilium category, and be eligible for backport to v1.15, especially considering that the change is non-invasive. Marking accordingly.

@giorio94 giorio94 mentioned this pull request Jan 27, 2025
8 tasks
@giorio94 giorio94 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Jan 27, 2025
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Jan 29, 2025
dkoshkin added a commit to nutanix-cloud-native/cluster-api-runtime-extensions-nutanix that referenced this pull request Feb 21, 2025
dkoshkin added a commit to nutanix-cloud-native/cluster-api-runtime-extensions-nutanix that referenced this pull request Feb 21, 2025
faiq pushed a commit to nutanix-cloud-native/cluster-api-runtime-extensions-nutanix that referenced this pull request Feb 21, 2025
**What problem does this PR solve?**:
This version fixes an upstream bug
cilium/cilium#37086



**Which issue(s) this PR fixes**:
Fixes #

**How Has This Been Tested?**:
<!--
Please describe the tests that you ran to verify your changes.
Provide output from the tests and any manual steps needed to replicate
the tests.
-->

**Special notes for your reviewer**:
<!--
Use this to provide any additional information to the reviewers.
This may include:
- Best way to review the PR.
- Where the author wants the most review attention on.
- etc.
-->
I branched `release/v0.27.x` from the last release and this branch does
not include:
*
#1054
*
#1055
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch area/daemon Impacts operation of the Cilium daemon. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants