Skip to content

Conversation

jrife
Copy link
Contributor

@jrife jrife commented Jun 28, 2025

Transient errors inside regenerateBPF can lead to an endpoint's policy map either not being initialized or being accidentally cleared out, resulting in broken connectivity.

Scenario

In this scenario, a transient error causes the first call to regenerateBPF to fail for an endpoint. The second regeneration attempt succeeds, but the final policy map state is empty.

initial state:
    e.desiredPolicy == e.realizedPolicy == <empty EndpointPolicy>

0:  e.ComputeInitialPolicy(...)
1:      e.regeneratePolicy(...)
2:          result := &policyGenerateResult{
                endpointPolicy:   e.desiredPolicy,
                identityRevision: e.identityRevision,
            }
3:          selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...)
            ...
4:          result.endpointPolicy = selectorPolicy.DistillPolicy(...)
            ...
5:      e.setDesiredPolicy(...)
6:          e.setNextPolicyRevision(res.policyRevision)
7:              e.nextPolicyRevision = revision
8:          if ... && res.endpointPolicy != e.desiredPolicy {
                ...
9:              e.desiredPolicy = res.endpointPolicy
10:             datapathRegenCtx.revertStack.Push(func() error {
                    ...
                    e.desiredPolicy = e.realizedPolicy
                    ...
                    _, _, err = e.syncPolicyMapWith(currentMap, false);
                    ...
                })
            }
12:         res.endpointPolicy = nil

current state:
    e.desiredPolicy = <computed policy>
    e.realizedPolicy = <empty policy>

13: e.regenerate(...)
14:     revision, err := e.regenerateBPF(...)
            e.runPreCompilationSteps(regenContext)
	        e.regeneratePolicy(...)
		    ...
		    selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...)
		    <no-op, since selectorPolicy == nil>
                e.setDesiredPolicy(...) <no-op - e.desiredPolicy matches
                                         the computed policy from above>
15:         defer func() {
                if !e.isProperty(PropertyFakeEndpoint) {
                    e.finalizeProxyState(regenContext, reterr)
                }
            }()
	    ...
16:         err = e.realizeBPFState(regenContext)
17:         if err != nil {
                <e.realizeBPFState returns an error>
                ...
            }
18:         e.finalizeProxyState(regenContext, reterr) <run deferred
	                                                function>
19:             datapathRegenCtx.revertStack.Revert() <run pushed
                                                       revertStack
                                                       function from 10>
20:                 e.desiredPolicy = e.realizedPolicy <from 10>

current state:
    e.desiredPolicy == e.realizedPolicy == <empty policy>

<reattempt regeneration>
22: e.regenerate(...)
23:     revision, err := e.regenerateBPF(...)
            e.runPreCompilationSteps(regenContext)
	        e.regeneratePolicy(...)
		    ...
		    selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...)
		    <no-op, since selectorPolicy == nil>
                e.setDesiredPolicy(...) <no-op - e.desiredPolicy still
                                         points to e.realizedPolicy,
					 which is empty>
	    ...

current state:
    e.desiredPolicy == e.realizedPolicy == <empty policy>

25:         err = e.realizeBPFState(regenContext) <e.realizeBPFState
                                                   succeeds>
	    ...
	...
26:     return e.updateRealizedState(stats, origDir, revision) <
            ...
            e.realizedPolicy = e.desiredPolicy <"locks in" the empty
                                                map>
            ...

current state:
    e.desiredPolicy == e.realizedPolicy == <empty policy>

<regeneration is done>

Explanation

The second regeneration attempt does not recompute the endpoint policy, since we update e.nextPolicyRevision in the first attempt, and hence, e.desiredPolicy is not changed. The call succeeds, then the empty map state is "locked in" until a recomputation happens in the future.

The Fix

To remedy this, revert nextPolicyRevision in the revert stack to ensure that e.desiredPolicy is recomputed on the next regeneration attempt.

Make Sure e.realizedPolicy Reflects Reality

Note that the revertStack function above relies on the assumption that e.realizedPolicy reflects the current endpoint policy map state. However, if the agent has just started, e.realizedPolicy may be empty even for existing endpoints. Copy the state from the policyMapDump to e.realizedPolicy inside e.runPreCompilationSteps to make sure the restore logic actually works in these situations instead of clearing out the existing map.

Testing

I tested this manually using the repro described in #38998. Before applying this patch, the repro resulted in broken connectivity to the coredns pods every time I started up my kind cluster. After applying this patch, they start up reliably and all liveness/readiness probes succeed.

Fixes: #38998

policy: Fix a bug where transient errors in endpoint regeneration lead to broken connectivity.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 28, 2025
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Jun 28, 2025
@jrife
Copy link
Contributor Author

jrife commented Jun 28, 2025

/test

@jrife
Copy link
Contributor Author

jrife commented Jun 29, 2025

/test

@jrife
Copy link
Contributor Author

jrife commented Jun 29, 2025

/test

@jrife jrife marked this pull request as ready for review June 29, 2025 21:29
@jrife jrife requested review from a team as code owners June 29, 2025 21:29
@jrife jrife requested review from derailed and tommyp1ckles June 29, 2025 21:29
@jrife
Copy link
Contributor Author

jrife commented Jun 29, 2025

@tsotne95, since you also reproduced #38998, could you retest in your environment with this patch as well? It would be useful to repro with your methodology and see if this patch resolves the issues you saw. It's possible there are some other weird issues resulting from transient errors in the regeneration path.

@tsotne95
Copy link
Contributor

tsotne95 commented Jul 1, 2025

I commented on issue

@jrife
Copy link
Contributor Author

jrife commented Jul 7, 2025

@tommyp1ckles @derailed Could you take a look at this when you get a chance? This fix addresses an issue which is currently a 1.18 release blocker, though the issue actually exists in 1.17 as well.

Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrife Nice work!

@squeed squeed self-requested a review July 9, 2025 15:26
Copy link
Contributor

@anubhabMajumdar anubhabMajumdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

@squeed
Copy link
Contributor

squeed commented Jul 10, 2025

Thanks for the excellent PR description. I'm reading through it, to make sure I understand all the implications. I haven't dropped it on the floor, I promise :-)

@squeed
Copy link
Contributor

squeed commented Jul 10, 2025

Rather than trying to back-fill e.realizedPolicy(), what about setting e.forcePolicyCompute = true when reverting policy? Would that have the desired effect?

@jrife
Copy link
Contributor Author

jrife commented Jul 14, 2025

@squeed Thanks for taking a look!

Rather than trying to back-fill e.realizedPolicy(), what about setting e.forcePolicyCompute = true when reverting policy? Would that have the desired effect?

Isn't the intent of the revert to go back to a map state that was "good" instead of leaving it in some weird intermediate state? If so, I'm not sure that e.forcePolicyCompute really accomplishes the same thing. Actually, if we're referring to this block:

defer func(p *policy.EndpointPolicy) {
	e.desiredPolicy = p
}(e.desiredPolicy)

I didn't want to force policy recomputation for fear that it would be unnecessarily expensive if we get stuck in some loop where this condition keeps occurring. After all, e.desiredPolicy should be perfectly fine and there's no real reason to recompute it on the next go around; it's just the BPF map state that needs to be synchronized.

@squeed
Copy link
Contributor

squeed commented Jul 15, 2025

Isn't the intent of the revert to go back to a map state that was "good"

@jrife yah, fair enough. Some of the code special-cases e.realizedPolicy.Empty() == true for handling the before-first-regeneration state. But, clearly, not consistently. Plus, there's no guarantee that e.realizedPolicy.Empty() won't be true in the future (e.g. policy disabled).

That said, I'm not super concerned about a few extra policy computations. I'd rather do correctness first, then skip calculations when we understand all possible cases. This bug is a clear example of that :-). In particular, the conditions in runPreCompilationSteps() deciding whether or not to regenerate policy are probably incorrect.

@squeed
Copy link
Contributor

squeed commented Jul 17, 2025

The fact that this code touches e.desiredPolicy at all seems strange to me and seems driven by the fact that syncPolicyMapWith does its synchronization against e.desiredPolicy.

These are great questions. Which has me wondering: do we even need e.desiredPolicy? Obviously that would be a refactor, not a bugfix, but it would make the state management so much clearer.

@squeed
Copy link
Contributor

squeed commented Jul 17, 2025

One more question. In your (excellent) timeline, there are the steps

<reattempt regeneration>
22: e.regenerate(...)
23:     revision, err := e.regenerateBPF(...)
            e.runPreCompilationSteps(regenContext)
	        e.regeneratePolicy(...)
		    ...
		    selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...)
		    <no-op, since selectorPolicy == nil>

why is selectorPolicy nil here?

@jrife
Copy link
Contributor Author

jrife commented Jul 21, 2025

why is selectorPolicy nil here?

I believe I was referring to this condition:

// Do we already have a given revision?
// If so, skip calculation.
if skipRevision >= rev {
return nil, rev, nil
}

GetSelectorPolicy returns nil in the timeline, since we already advanced e.nextPolicyRevision in the first attempt.

@jrife
Copy link
Contributor Author

jrife commented Jul 23, 2025

@jrife did you have a short writeup of the CI triage you performed? It'd be helpful to understand which issues the PR is facing.

It depends on what you mean by writeup. I tried to link the CI failures I saw with some known issues above. The latest failure with ci-clustermesh looks the same as the failure I saw earlier in ci-eks:

check-log-errors/no-errors-in-logs:pkg/endpoint:kind-cluster2-16478452662/kube-system/cilium-r5jtl (cilium-agent): Found 1 logs in kind-cluster2-16478452662/kube-system/cilium-r5jtl (cilium-agent) matching list of errors that must be investigated:
time=2025-07-23T18:13:22.859972997Z level=warn source=/go/src/github.com/cilium/cilium/pkg/endpoint/endpoint.go:1785 msg="Unable to fetch kubernetes labels" desiredPolicyRevision=1 ciliumEndpointName=kube-system/coredns-f89dfbcc7-sp4r7 endpointID=3762 containerID=182d4fe81b containerInterface="" datapathPolicyRevision=1 ipv6=fd00:10:244:1::695b identity=5 k8sPodName=kube-system/coredns-f89dfbcc7-sp4r7 ipv4=10.244.1.31 subsys=resolve-labels error="pod.core \"coredns-f89dfbcc7-sp4r7\" not found" (30 occurrences)

There's a warning coming out of this block at some point in the logs:

if restoredEndpoint && k8sErrors.IsNotFound(err) {
e.Logger(resolveLabels).Info(
"Unable to resolve metadata during endpoint restoration. Is the pod still running?",
logfields.Error, err,
)
} else {
e.Logger(resolveLabels).Warn(
"Unable to fetch kubernetes labels",
logfields.Error, err,
)
}

I see this which seems like the same thing: #40682

@joestringer
Copy link
Member

joestringer commented Jul 23, 2025

@jrife Ah sorry I think I just didn't parse the comments very well. When I saw all the re-run comments and looked earlier in the PR, I just didn't identify the information from the posts above. The most obvious format to understand this would be a comment in the following form:

  • ci-eks run <here> hit <issue>
  • clustermesh run <here> hit <issue>
  • gateway-api run <here> hit <issue>
  • ipsec-e2e run <here> hit <issue>

Now that we've re-run I can see that the same tests are failing again. At least as of yesterday, I was aware that the clustermesh test was somewhat unreliable, but I wasn't aware of the failures for these other branches.

@joestringer
Copy link
Member

Looking over the failures again, I'm wondering if the tree further destabilized over the past day or two. At a glance the failures aren't obviously related to the PR, but that's often the case whether they are related or not; that's the downside with relying so much on e2e tests 😢 .

Here's the general trend from the CI dashboard for main branch over the past week, but the failure rates only seem to increase proportionately to the number of tests that are run:

image

@joestringer
Copy link
Member

In this screenshot I've filtered main branch for the last week for only scheduled/regular builds. Top graph shows a fairly consistent green bar (total runs) and no obvious trend for orange bars (total failures). Trailing end on right hand side is just the latest 6h data is not yet triggered or not scraped yet.

image

@joestringer
Copy link
Member

@jrife as for your most recent report, I agree those look similar and given Marcel's update / investigation in the thread, that particular failure case was likely introduced into the tree recently.

@jrife
Copy link
Contributor Author

jrife commented Jul 23, 2025

The most obvious format to understand this would be a comment in the following form:

@joestringer Sure, I can organize things a bit more in a single comment. I'll go back through the failures and link to those that match existing CI flake issues. I think all of them look like reported issues, but I'll admit that I can't be 100% sure that this change isn't related. I don't see any obvious reasons why this PR should increase the failure rate, but I have been surprised by seemingly benign changes before :).

@jrife
Copy link
Contributor Author

jrife commented Jul 23, 2025

@joestringer I wonder if it would be worth opening another PR at the same base commit without the changes in this PR just to get a better baseline?

@joestringer
Copy link
Member

@jrife I was just thinking about the same thing. CI will skip some jobs if it seems like the PR doesn't touch those tests, so based on the .github/ariane-config.yaml we'd probably want to add some line to some code file, even a comment should do the trick.

I was just glancing through some of the other recent PRs on main also, but that's not providing much signal. ci-clustermesh is definitely unreliable due to #39764. You pointed out the other labels one which Marcel was investigating, so it seems likely that was introduced recently. Not sure about the others.

@jrife
Copy link
Contributor Author

jrife commented Jul 23, 2025

@joestringer I was thinking of just commenting out the existing code, so I imagine the same checks would run? Not sure how smart the test filtering is...

@joestringer
Copy link
Member

ci-clustermesh
Run hit #40682

This one looks like #39370 instead (error="retrieving device lxc2772fac915a8: Link not found"). The other assessments look accurate though yeah.

@jrife
Copy link
Contributor Author

jrife commented Jul 23, 2025

@joestringer Looks like both of those are present in that run. I updated the comment to list them both. FWIW I also saw these both in the ci-clustermesh run over in my dummy PR.

@joestringer
Copy link
Member

joestringer commented Jul 23, 2025

@jrife I don't see "subsys=resolve-labels" in that run the run linked in my last post. Are you sure?

@jrife
Copy link
Contributor Author

jrife commented Jul 23, 2025

@jrife I don't see "subsys=resolve-labels" in that run the run linked in my last post. Are you sure?

You're right. I just saw "not found" there and thought it was the same.

time=2025-07-23T18:22:21.072492528Z level=error source=/go/src/github.com/cilium/cilium/pkg/endpointmanager/endpointsynchronizer.go:309 msg="Cannot update CEP" containerInterface=eth0 identity=33752 k8sPodName=kube-system/clustermesh-apiserver-6f5c4d98cb-kqc4r endpointID=2551 ciliumEndpointName=kube-system/clustermesh-apiserver-6f5c4d98cb-kqc4r datapathPolicyRevision=0 ipv6=fd00:10:242:2::e996 desiredPolicyRevision=2 ipv4=10.242.2.126 containerID=23c2fbbd38 subsys=endpointsynchronizer controller="sync-to-k8s-ciliumendpoint (2551)" error="ciliumendpoints.cilium.io "clustermesh-apiserver-6f5c4d98cb-kqc4r" not found"

It's coming from a different place in the code than #40682

Transient errors inside regenerateBPF can lead to an endpoint's policy
map either not being initialized or being accidentally cleared out,
resulting in broken connectivity.

Scenario
========
In this scenario, a transient error causes the first call to
regenerateBPF to fail for an endpoint. The second regeneration attempt
succeeds, but the final policy map state is empty.

initial state:
    e.desiredPolicy == e.realizedPolicy == <empty EndpointPolicy>

0:  e.ComputeInitialPolicy(...)
1:      e.regeneratePolicy(...)
2:          result := &policyGenerateResult{
                endpointPolicy:   e.desiredPolicy,
                identityRevision: e.identityRevision,
            }
3:          selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...)
            ...
4:          result.endpointPolicy = selectorPolicy.DistillPolicy(...)
            ...
5:      e.setDesiredPolicy(...)
6:          e.setNextPolicyRevision(res.policyRevision)
7:              e.nextPolicyRevision = revision
8:          if ... && res.endpointPolicy != e.desiredPolicy {
                ...
9:              e.desiredPolicy = res.endpointPolicy
10:             datapathRegenCtx.revertStack.Push(func() error {
                    ...
                    e.desiredPolicy = e.realizedPolicy
                    ...
                    _, _, err = e.syncPolicyMapWith(currentMap, false);
                    ...
                })
            }
12:         res.endpointPolicy = nil

current state:
    e.desiredPolicy = <computed policy>
    e.realizedPolicy = <empty policy>

13: e.regenerate(...)
14:     revision, err := e.regenerateBPF(...)
            e.runPreCompilationSteps(regenContext)
	        e.regeneratePolicy(...)
		    ...
		    selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...)
		    <no-op, since selectorPolicy == nil>
                e.setDesiredPolicy(...) <no-op - e.desiredPolicy matches
                                         the computed policy from above>
15:         defer func() {
                if !e.isProperty(PropertyFakeEndpoint) {
                    e.finalizeProxyState(regenContext, reterr)
                }
            }()
	    ...
16:         err = e.realizeBPFState(regenContext)
17:         if err != nil {
                <e.realizeBPFState returns an error>
                ...
            }
18:         e.finalizeProxyState(regenContext, reterr) <run deferred
	                                                function>
19:             datapathRegenCtx.revertStack.Revert() <run pushed
                                                       revertStack
                                                       function from 10>
20:                 e.desiredPolicy = e.realizedPolicy <from 10>

current state:
    e.desiredPolicy == e.realizedPolicy == <empty policy>

<reattempt regeneration>
22: e.regenerate(...)
23:     revision, err := e.regenerateBPF(...)
            e.runPreCompilationSteps(regenContext)
	        e.regeneratePolicy(...)
		    ...
		    selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...)
		    <no-op, since selectorPolicy == nil>
                e.setDesiredPolicy(...) <no-op - e.desiredPolicy still
                                         points to e.realizedPolicy,
					 which is empty>
	    ...

current state:
    e.desiredPolicy == e.realizedPolicy == <empty policy>

25:         err = e.realizeBPFState(regenContext) <e.realizeBPFState
                                                   succeeds>
	    ...
	...
26:     return e.updateRealizedState(stats, origDir, revision) <
            ...
            e.realizedPolicy = e.desiredPolicy <"locks in" the empty
                                                map>
            ...

current state:
    e.desiredPolicy == e.realizedPolicy == <empty policy>

<regeneration is done>

Explanation
===========
The second regeneration attempt does not recompute the endpoint policy,
since we update e.nextPolicyRevision in the first attempt, and hence,
e.desiredPolicy is not changed. The call succeeds, then the empty map
state is "locked in" until a recomputation happens in the future.

The Fix
=======
To remedy this, restore the original value of e.desiredPolicy inside the
revertStack function (10 in the sequence above) after it has a chance to
restore the original policy map state:

    ...
    defer func(p *policy.EndpointPolicy) {
        e.desiredPolicy = p
    }(e.desiredPolicy)
    e.desiredPolicy = e.realizedPolicy
    ...

The rationale here is that we still want e.desiredPolicy to reflect our
desired state of the world. We assign it temporarily to e.realizedPolicy
so that e.syncPolicyMapWith works, but we don't want it to stay that way
for the next regeneration attempt, since it may or may not recompute
the desired endpoint policy.

Make Sure e.realizedPolicy Reflects Reality
===========================================
Note that the revertStack function above relies on the assumption that
e.realizedPolicy reflects the current endpoint policy map state.
However, if the agent has just started, e.realizedPolicy may be empty
even for existing endpoints. Copy the state from the policyMapDump to
e.realizedPolicy inside e.runPreCompilationSteps to make sure the
restore logic actually works in these situations instead of clearing out
the existing map.

Fixes: cilium#38998

Signed-off-by: Jordan Rife <jrife@google.com>
@jrife
Copy link
Contributor Author

jrife commented Jul 24, 2025

/test

@joestringer
Copy link
Member

@jrife thanks for helping to dig into the root causes of those issues affecting the tree! Merging now.

Merged via the queue into cilium:main with commit f052828 Jul 24, 2025
68 checks passed
@joestringer
Copy link
Member

joestringer commented Jul 24, 2025

@jrife Oh, what happened between commit e03ca9e until the final version committed here f052828 ? I thought the intent was to move away from the defer approach. Was that intentional?

@jrife
Copy link
Contributor Author

jrife commented Jul 24, 2025

@jrife Oh, what happened between commit e03ca9e until the final version committed here f052828 ? I thought the intent was to move away from the defer approach. Was that intentional?

Nope, mistake on my part. I did the rebase earlier against an older version of this branch -_-. I'll go ahead and create a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Traffic drop due to empty policy map after retry due to endpoint regeneration failure
7 participants