-
Notifications
You must be signed in to change notification settings - Fork 3.4k
endpoint, policy: Don't accidentally clear out endpoint policy maps #40255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/test |
/test |
/test |
@tsotne95, since you also reproduced #38998, could you retest in your environment with this patch as well? It would be useful to repro with your methodology and see if this patch resolves the issues you saw. It's possible there are some other weird issues resulting from transient errors in the regeneration path. |
@tommyp1ckles @derailed Could you take a look at this when you get a chance? This fix addresses an issue which is currently a 1.18 release blocker, though the issue actually exists in 1.17 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrife Nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
Thanks for the excellent PR description. I'm reading through it, to make sure I understand all the implications. I haven't dropped it on the floor, I promise :-) |
Rather than trying to back-fill |
@squeed Thanks for taking a look!
Isn't the intent of the revert to go back to a map state that was "good" instead of leaving it in some weird intermediate state? If so, I'm not sure that
I didn't want to force policy recomputation for fear that it would be unnecessarily expensive if we get stuck in some loop where this condition keeps occurring. After all, |
@jrife yah, fair enough. Some of the code special-cases That said, I'm not super concerned about a few extra policy computations. I'd rather do correctness first, then skip calculations when we understand all possible cases. This bug is a clear example of that :-). In particular, the conditions in |
These are great questions. Which has me wondering: do we even need |
One more question. In your (excellent) timeline, there are the steps
why is |
I believe I was referring to this condition: cilium/pkg/policy/repository.go Lines 496 to 500 in aedded1
|
It depends on what you mean by writeup. I tried to link the CI failures I saw with some known issues above. The latest failure with
There's a warning coming out of this block at some point in the logs: cilium/pkg/endpoint/endpoint.go Lines 1779 to 1789 in 1222c37
I see this which seems like the same thing: #40682 |
@jrife Ah sorry I think I just didn't parse the comments very well. When I saw all the re-run comments and looked earlier in the PR, I just didn't identify the information from the posts above. The most obvious format to understand this would be a comment in the following form:
Now that we've re-run I can see that the same tests are failing again. At least as of yesterday, I was aware that the clustermesh test was somewhat unreliable, but I wasn't aware of the failures for these other branches. |
@jrife as for your most recent report, I agree those look similar and given Marcel's update / investigation in the thread, that particular failure case was likely introduced into the tree recently. |
@joestringer Sure, I can organize things a bit more in a single comment. I'll go back through the failures and link to those that match existing CI flake issues. I think all of them look like reported issues, but I'll admit that I can't be 100% sure that this change isn't related. I don't see any obvious reasons why this PR should increase the failure rate, but I have been surprised by seemingly benign changes before :). |
@joestringer I wonder if it would be worth opening another PR at the same base commit without the changes in this PR just to get a better baseline? |
@jrife I was just thinking about the same thing. CI will skip some jobs if it seems like the PR doesn't touch those tests, so based on the I was just glancing through some of the other recent PRs on main also, but that's not providing much signal. ci-clustermesh is definitely unreliable due to #39764. You pointed out the other labels one which Marcel was investigating, so it seems likely that was introduced recently. Not sure about the others. |
@joestringer I was thinking of just commenting out the existing code, so I imagine the same checks would run? Not sure how smart the test filtering is... |
Just summarizing the latest batch of failures. Looks like a lot are due to #40682 Failure Summary
|
@joestringer Looks like both of those are present in that run. I updated the comment to list them both. FWIW I also saw these both in the |
@jrife I don't see "subsys=resolve-labels" in |
You're right. I just saw "not found" there and thought it was the same.
It's coming from a different place in the code than #40682 |
Transient errors inside regenerateBPF can lead to an endpoint's policy map either not being initialized or being accidentally cleared out, resulting in broken connectivity. Scenario ======== In this scenario, a transient error causes the first call to regenerateBPF to fail for an endpoint. The second regeneration attempt succeeds, but the final policy map state is empty. initial state: e.desiredPolicy == e.realizedPolicy == <empty EndpointPolicy> 0: e.ComputeInitialPolicy(...) 1: e.regeneratePolicy(...) 2: result := &policyGenerateResult{ endpointPolicy: e.desiredPolicy, identityRevision: e.identityRevision, } 3: selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...) ... 4: result.endpointPolicy = selectorPolicy.DistillPolicy(...) ... 5: e.setDesiredPolicy(...) 6: e.setNextPolicyRevision(res.policyRevision) 7: e.nextPolicyRevision = revision 8: if ... && res.endpointPolicy != e.desiredPolicy { ... 9: e.desiredPolicy = res.endpointPolicy 10: datapathRegenCtx.revertStack.Push(func() error { ... e.desiredPolicy = e.realizedPolicy ... _, _, err = e.syncPolicyMapWith(currentMap, false); ... }) } 12: res.endpointPolicy = nil current state: e.desiredPolicy = <computed policy> e.realizedPolicy = <empty policy> 13: e.regenerate(...) 14: revision, err := e.regenerateBPF(...) e.runPreCompilationSteps(regenContext) e.regeneratePolicy(...) ... selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...) <no-op, since selectorPolicy == nil> e.setDesiredPolicy(...) <no-op - e.desiredPolicy matches the computed policy from above> 15: defer func() { if !e.isProperty(PropertyFakeEndpoint) { e.finalizeProxyState(regenContext, reterr) } }() ... 16: err = e.realizeBPFState(regenContext) 17: if err != nil { <e.realizeBPFState returns an error> ... } 18: e.finalizeProxyState(regenContext, reterr) <run deferred function> 19: datapathRegenCtx.revertStack.Revert() <run pushed revertStack function from 10> 20: e.desiredPolicy = e.realizedPolicy <from 10> current state: e.desiredPolicy == e.realizedPolicy == <empty policy> <reattempt regeneration> 22: e.regenerate(...) 23: revision, err := e.regenerateBPF(...) e.runPreCompilationSteps(regenContext) e.regeneratePolicy(...) ... selectorPolicy, ... = e.policyRepo.GetSelectorPolicy(...) <no-op, since selectorPolicy == nil> e.setDesiredPolicy(...) <no-op - e.desiredPolicy still points to e.realizedPolicy, which is empty> ... current state: e.desiredPolicy == e.realizedPolicy == <empty policy> 25: err = e.realizeBPFState(regenContext) <e.realizeBPFState succeeds> ... ... 26: return e.updateRealizedState(stats, origDir, revision) < ... e.realizedPolicy = e.desiredPolicy <"locks in" the empty map> ... current state: e.desiredPolicy == e.realizedPolicy == <empty policy> <regeneration is done> Explanation =========== The second regeneration attempt does not recompute the endpoint policy, since we update e.nextPolicyRevision in the first attempt, and hence, e.desiredPolicy is not changed. The call succeeds, then the empty map state is "locked in" until a recomputation happens in the future. The Fix ======= To remedy this, restore the original value of e.desiredPolicy inside the revertStack function (10 in the sequence above) after it has a chance to restore the original policy map state: ... defer func(p *policy.EndpointPolicy) { e.desiredPolicy = p }(e.desiredPolicy) e.desiredPolicy = e.realizedPolicy ... The rationale here is that we still want e.desiredPolicy to reflect our desired state of the world. We assign it temporarily to e.realizedPolicy so that e.syncPolicyMapWith works, but we don't want it to stay that way for the next regeneration attempt, since it may or may not recompute the desired endpoint policy. Make Sure e.realizedPolicy Reflects Reality =========================================== Note that the revertStack function above relies on the assumption that e.realizedPolicy reflects the current endpoint policy map state. However, if the agent has just started, e.realizedPolicy may be empty even for existing endpoints. Copy the state from the policyMapDump to e.realizedPolicy inside e.runPreCompilationSteps to make sure the restore logic actually works in these situations instead of clearing out the existing map. Fixes: cilium#38998 Signed-off-by: Jordan Rife <jrife@google.com>
/test |
@jrife thanks for helping to dig into the root causes of those issues affecting the tree! Merging now. |
Nope, mistake on my part. I did the rebase earlier against an older version of this branch -_-. I'll go ahead and create a new PR. |
Transient errors inside
regenerateBPF
can lead to an endpoint's policy map either not being initialized or being accidentally cleared out, resulting in broken connectivity.Scenario
In this scenario, a transient error causes the first call to
regenerateBPF
to fail for an endpoint. The second regeneration attempt succeeds, but the final policy map state is empty.Explanation
The second regeneration attempt does not recompute the endpoint policy, since we update
e.nextPolicyRevision
in the first attempt, and hence,e.desiredPolicy
is not changed. The call succeeds, then the empty map state is "locked in" until a recomputation happens in the future.The Fix
To remedy this, revert
nextPolicyRevision
in the revert stack to ensure thate.desiredPolicy
is recomputed on the next regeneration attempt.Make Sure e.realizedPolicy Reflects Reality
Note that the
revertStack
function above relies on the assumption thate.realizedPolicy
reflects the current endpoint policy map state. However, if the agent has just started,e.realizedPolicy
may be empty even for existing endpoints. Copy the state from thepolicyMapDump
toe.realizedPolicy
insidee.runPreCompilationSteps
to make sure the restore logic actually works in these situations instead of clearing out the existing map.Testing
I tested this manually using the repro described in #38998. Before applying this patch, the repro resulted in broken connectivity to the
coredns
pods every time I started up my kind cluster. After applying this patch, they start up reliably and all liveness/readiness probes succeed.Fixes: #38998