endpoint: drop duplicate RunMetadataResolver call #37086

giorio94 · 2025-01-20T11:33:11Z

Do not explicitly call RunMetadataResolver in case the pod namespace label has not been retrieved yet (e.g., the pod informer has not yet received the creation event for the pod), given that the same operation is always guaranteed to be started immediately afterwards as well. Moreover, this duplicate call can eventually cause the endpoint regeneration state machine to go out of sync, leaving the endpoint stuck in "waiting-to-regenerate" state forever.

Specifically, the above problem occurs due to the following sequence of events:

A new pod gets created and scheduled onto a node, in turn eventually leading to an endpoint creation API request.
When the request is processed, the pod informer has not yet received the pod creation event, hence causing the pod retrieval to fail.
At this point, the set of identity labels only contains the init one (because we couldn't retrieve the pod labels), hence triggering the first call to ep.RunMetadataResolver. In turn, this starts a controller in the background (not blocking, as the second parameter is false), which attempts to retrieve the labels again, sets the identity (using the init one in case of retrieval failure) and triggers the endpoint regeneration. The endpoint first transitions into waiting-to-regenerate state); eventually, the regeneration starts to be processed, and the endpoint moves into regenerating state.
ep.RunMetadataResolver is called a second time; this step would normally be blocking (the second parameter is true), but in case the labels are already correct due to the previous call to the same function, no regeneration would be required, eventually returning regenTriggered = false;
Given that regenTriggered is false, the regeneration is triggered again immediately afterwards [1], transitioning the endpoint into waiting-to-regenerate state again. It is important to note that in this case we propagate the endpoint creation context as part of the regeneration event, rather than using the endpoint context. This appears correct, but ends up being problematic in this specific case.
In background, the initial regeneration triggered in 3. completes correctly. The endpoint is attempted to be transitioned into ready state, but that's skipped (correctly) because another regeneration has been scheduled subsequently (in 5.). At the same time, WaitForFirstRegeneration [2] unblocks (given that the regeneration completed), letting the endpoint creation API request to terminate. This, in turn, causes the associated context to be canceled.
At this point, there's still the endpoint regeneration triggered in
5. that needs to be executed; however, depending on the timing, the handling may be aborted in [3], given that the context has just been canceled; while this would normally indicate that the endpoint is terminating, hence not causing any problems, in this case it leaves the endpoint stuck into the waiting-to-regenerate state, from which it will never recover (as all subsequent regeneration triggers would be treated as duplicates).

Dropping the duplicate call to ep.RunMetadataResolver prevents this problem, as it ensures that we don't unnecessarily trigger a second regeneration associated with the API context. The fallback regeneration should actually never required when operating in Kubernetes mode, because ep.RunMetadataResolver will always cause a new identity to be assigned to the endpoint, in turn triggering regeneration.

[1]:

cilium/daemon/cmd/endpoint.go

Lines 569 to 582 in a9bcb0c

    
           if !regenTriggered { 
        
           	regenMetadata := &regeneration.ExternalRegenerationMetadata{ 
        
           		Reason:            "Initial build on endpoint creation", 
        
           		ParentContext:     ctx, 
        
           		RegenerationLevel: regeneration.RegenerateWithDatapath, 
        
           	} 
        
           	build, err := ep.SetRegenerateStateIfAlive(regenMetadata) 
        
           	if err != nil { 
        
           		return d.errorDuringCreation(ep, err) 
        
           	} 
        
           	if build { 
        
           		ep.Regenerate(regenMetadata) 
        
           	} 
        
           }

[2]: https://github.com/isovalent/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L573-L575
[3]:

cilium/pkg/endpoint/events.go

Lines 64 to 71 in a9bcb0c

    
           // We should only queue the request after we use all the endpoint's 
        
           // lock/unlock. Otherwise this can get a deadlock if the endpoint is 
        
           // being deleted at the same time. More info PR-1777. 
        
           doneFunc, err := e.owner.QueueEndpointBuild(regenContext.parentContext, uint64(e.ID)) 
        
           if err != nil { 
        
           	if !errors.Is(err, context.Canceled) { 
        
           		e.getLogger().WithError(err).Warning("unable to queue endpoint build") 
        
           	}

Fix bug potentially causing newly added endpoints to remain stuck in waiting-to-regenerate state forever, causing traffic from/to that endpoint to be incorrectly dropped.

Do not explicitly call RunMetadataResolver in case the pod namespace label has not been retrieved yet (e.g., the pod informer has not yet received the creation event for the pod), given that the same operation is always guaranteed to be started immediately afterwards as well. Moreover, this duplicate call can eventually cause the endpoint regeneration state machine to go out of sync, leaving the endpoint stuck in "waiting-to-regenerate" state forever. Specifically, the above problem occurs due to the following sequence of events: 1. A new pod gets created and scheduled onto a node, in turn eventually leading to an endpoint creation API request. 2. When the request is processed, the pod informer has not yet received the pod creation event, hence causing the pod retrieval to fail. 3. At this point, the set of identity labels only contains the init one (because we couldn't retrieve the pod labels), hence triggering the first call to ep.RunMetadataResolver. In turn, this starts a controller in the background (not blocking, as the second parameter is false), which attempts to retrieve the labels again, sets the identity (using the init one in case of retrieval failure) and triggers the endpoint regeneration. The endpoint first transitions into waiting-to-regenerate state); eventually, the regeneration starts to be processed, and the endpoint moves into regenerating state. 4. ep.RunMetadataResolver is called a second time; this step would normally be blocking (the second parameter is true), but in case the labels are already correct due to the previous call to the same function, no regeneration would be required, eventually returning regenTriggered = false; 5. Given that regenTriggered is false, the regeneration is triggered again immediately afterwards [1], transitioning the endpoint into waiting-to-regenerate state again. It is important to note that in this case we propagate the endpoint creation context as part of the regeneration event, rather than using the endpoint context. This appears correct, but ends up being problematic in this specific case. 6. In background, the initial regeneration triggered in 3. completes correctly. The endpoint is attempted to be transitioned into ready state, but that's skipped (correctly) because another regeneration has been scheduled subsequently (in 5.). At the same time, WaitForFirstRegeneration [2] unblocks (given that the regeneration completed), letting the endpoint creation API request to terminate. This, in turn, causes the associated context to be canceled. 7. At this point, there's still the endpoint regeneration triggered in 5. that needs to be executed; however, depending on the timing, the handling may be aborted in [3], given that the context has just been canceled; while this would normally indicate that the endpoint is terminating, hence not causing any problems, in this case it leaves the endpoint stuck into the waiting-to-regenerate state, from which it will never recover (as all subsequent regeneration triggers would be treated as duplicates). Dropping the duplicate call to ep.RunMetadataResolver prevents this problem, as it ensures that we don't unnecessarily trigger a second regeneration associated with the API context. The fallback regeneration should actually never required when operating in Kubernetes mode, because ep.RunMetadataResolver will always cause a new identity to be assigned to the endpoint, in turn triggering regeneration. [1]: https://github.com/cilium/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L569-L582 [2]: https://github.com/isovalent/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/daemon/cmd/endpoint.go#L573-L575 [3]: https://github.com/cilium/cilium/blob/a9bcb0cb3858749f584c48cf5922f060c5f5871e/pkg/endpoint/events.go#L64-L71 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2025-01-20T11:33:25Z

/test

squeed · 2025-01-20T13:58:30Z

Thanks for the detailed timeline. It's causing me to thing about the case where the local informer is behind, so we fail the UID check. I wonder if this would be similarly affected -- I'm playing through this in my head now.

giorio94 · 2025-01-20T14:16:30Z

Thanks for the detailed timeline. It's causing me to thing about the case where the local informer is behind, so we fail the UID check. I wonder if this would be similarly affected -- I'm playing through this in my head now.

The main difference in that case is that the retrieval from the store is retried a few times and, eventually, a direct get from the API server is also attempted [1]. Overall it seems to me that in that case this bug would be less likely to show up (although still possible), but the fix would address it as well. A few related rough edges of that logic had also already been addressed in #36392 (although that one has not been backported).

[1]:

cilium/daemon/cmd/endpoint.go

Lines 470 to 472 in f602603

    
           if newPod, err2 := d.clientset.Slim().CoreV1().Pods(ep.K8sNamespace).Get( 
        
           	ctx, ep.K8sPodName, metav1.GetOptions{}, 
        
           ); err2 != nil {

squeed

This makes perfect sense, thanks!

antonipp · 2025-01-24T08:43:52Z

👋 Thank you very much for fixing this! Would it be possible to label it with needs-backport/1.15 as well, so that it gets backported there too?

giorio94 · 2025-01-24T09:34:58Z

👋 Thank you very much for fixing this! Would it be possible to label it with needs-backport/1.15 as well, so that it gets backported there too?

Yes, I think this may fall under the Major bugfixes relevant to the correct operation of Cilium category, and be eligible for backport to v1.15, especially considering that the change is non-invasive. Marking accordingly.

Fixes cilium/cilium#37086

**What problem does this PR solve?**: This version fixes an upstream bug cilium/cilium#37086 **Which issue(s) this PR fixes**: Fixes # **How Has This Been Tested?**:  **Special notes for your reviewer**:  I branched `release/v0.27.x` from the last release and this branch does not include: * #1054 * #1055

giorio94 marked this pull request as ready for review January 20, 2025 13:25

giorio94 requested a review from a team as a code owner January 20, 2025 13:25

giorio94 requested review from squeed and aanm January 20, 2025 13:25

squeed approved these changes Jan 21, 2025

View reviewed changes

squeed added this pull request to the merge queue Jan 21, 2025

Merged via the queue into cilium:main with commit 6d90e39 Jan 21, 2025
72 checks passed

rastislavs mentioned this pull request Jan 21, 2025

v1.17 Backports 2025-01-21 #37126

Merged

45 tasks

rastislavs added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 21, 2025

github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Jan 22, 2025

rastislavs mentioned this pull request Jan 22, 2025

v1.16 Backports 2025-01-22 #37168

Merged

19 tasks

rastislavs added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Jan 22, 2025

antonipp mentioned this pull request Jan 23, 2025

Dropped packets due to Deny policy on pod startup #35993

Closed

3 tasks

github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Jan 24, 2025

giorio94 added the needs-backport/1.15 label Jan 24, 2025

cilium-release-bot bot mentioned this pull request Jan 24, 2025

Prepare for release v1.17.0-rc.2 #37256

Merged

giorio94 mentioned this pull request Jan 27, 2025

v1.15 Backports 2025-01-27 #37281

Merged

8 tasks

giorio94 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Jan 27, 2025

github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Jan 29, 2025

giorio94 mentioned this pull request Feb 20, 2025

Intermittent endpoint regeneration failure leads to DNS resolution errors in pods #33515

Closed

3 tasks

dkoshkin added a commit to nutanix-cloud-native/cluster-api-runtime-extensions-nutanix that referenced this pull request Feb 21, 2025

fix: update Cilium version to 1.16.7

7159652

Fixes cilium/cilium#37086

dkoshkin mentioned this pull request Feb 21, 2025

fix: update Cilium version to 1.16.7 nutanix-cloud-native/cluster-api-runtime-extensions-nutanix#1056

Merged

dkoshkin added a commit to nutanix-cloud-native/cluster-api-runtime-extensions-nutanix that referenced this pull request Feb 21, 2025

fix: update Cilium version to 1.16.7

fa6a3fe

Fixes cilium/cilium#37086

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

endpoint: drop duplicate RunMetadataResolver call #37086

endpoint: drop duplicate RunMetadataResolver call #37086

Uh oh!

giorio94 commented Jan 20, 2025 •

edited

Loading

Uh oh!

giorio94 commented Jan 20, 2025

Uh oh!

squeed commented Jan 20, 2025

Uh oh!

giorio94 commented Jan 20, 2025

Uh oh!

squeed left a comment

Uh oh!

Uh oh!

antonipp commented Jan 24, 2025

Uh oh!

giorio94 commented Jan 24, 2025

Uh oh!

Uh oh!

	if !regenTriggered {
	regenMetadata := &regeneration.ExternalRegenerationMetadata{
	Reason: "Initial build on endpoint creation",
	ParentContext: ctx,
	RegenerationLevel: regeneration.RegenerateWithDatapath,
	}
	build, err := ep.SetRegenerateStateIfAlive(regenMetadata)
	if err != nil {
	return d.errorDuringCreation(ep, err)
	}
	if build {
	ep.Regenerate(regenMetadata)
	}
	}

	// We should only queue the request after we use all the endpoint's
	// lock/unlock. Otherwise this can get a deadlock if the endpoint is
	// being deleted at the same time. More info PR-1777.
	doneFunc, err := e.owner.QueueEndpointBuild(regenContext.parentContext, uint64(e.ID))
	if err != nil {
	if !errors.Is(err, context.Canceled) {
	e.getLogger().WithError(err).Warning("unable to queue endpoint build")
	}

endpoint: drop duplicate RunMetadataResolver call #37086

endpoint: drop duplicate RunMetadataResolver call #37086

Uh oh!

Conversation

giorio94 commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giorio94 commented Jan 20, 2025

Uh oh!

squeed commented Jan 20, 2025

Uh oh!

giorio94 commented Jan 20, 2025

Uh oh!

squeed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antonipp commented Jan 24, 2025

Uh oh!

giorio94 commented Jan 24, 2025

Uh oh!

Uh oh!

giorio94 commented Jan 20, 2025 •

edited

Loading