-
Notifications
You must be signed in to change notification settings - Fork 3.4k
policy: Fix transient policy deny during agent restart #17115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
policy: Fix transient policy deny during agent restart #17115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes!
How did you find and validate this? Was it enough to restart Cilium with chatty pods and policies in place? If so, why did we miss those issues in CI?
test-me-please |
Hi @pchaigno, my test setup to reproduce is like the following: enforce ingress policy (that allows access from clientpod) on serverpod on |
I'm guessing the CI 3.0 workflows didn't trigger because of yesterday's GitHub outage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
The only failing test is on EKS with IPsec and L7 policies; it therefore corresponds to #17139, a known broken test. Review requests are covered. I'm marking ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find, thanks for the submission!
I've got some concerns about the first patch. I have proposed an alternative solution that I think should be able to achieve the goal of fixing the transient policy deny during agent restart but without deferring the policy synchronization too late.
@@ -298,10 +298,13 @@ func (e *Endpoint) restoreIdentity() error { | |||
return ErrNotAlive | |||
case <-gotInitialGlobalIdentities: | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second patch LGTM.
I'm curious, do you have an easy way to reproduce these connectivity issues during agent restart? We have some tests like the Upgrade tests that attempt to do this, but it sounds like your testing is able to more reliably pick up on issues like this. If we can integrate such tests into the Cilium CI, we can hopefully catch these bugs in future before they get merged into the tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there was already discussion on this above in the PR, I missed that initially :)
We do not need to resolve this testing question to merge the fix, but it would be a nice follow-up to prevent future regressions.
9a038a7
to
7d15312
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks.
test-me-please |
#15878 (comment) @yorg1st @chaosbox should be able to do it next week. |
I did some (very brief) triage of the failing tests. If we click through the "Details" links and scan through, we find: Kind:
AKS:
EKS:
However, I would not expect this PR to have an effect on the tests. This PR adjusts the "endpoint restore" logic which is run when Cilium restarts with active workloads. As far as I know, the tests never restart Cilium and will only start pods after Cilium started and stop them before it shuts down. So I don't think they're related to the code being changed. I have opened a thread in Slack #testing channel to see whether anyone knows anything about such failures. I didn't yet search the GitHub issues to see whether there are similar reports; that would be another good next step, try to correlate with other reports and see whether there was a regression introduced into the tree recently. |
Currently, during endpoint restoration, policy maps are flushed before they are refilled, which introduces transient policy deny. Instead of flushing all policy map entries while restoring/initializing endpoint policyMap, this patch synchronizes the in-memory realized state with BPF map entries, and any potential discrepancy between desired and realized state would be dealt with by the following e.syncPolicyMap. Fixes: cilium#15878 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Currently, during endpoint restoration, ipcache map is unpinned and recreated by Map.OpenParallel, regenerated endpoints lookup from the newly created ipcache map, so the regeneration of endpoints should wait for the ipcache map to be synchronized. However, the regeneration of host endpoint doesn't wait for ipcache map sync, which introduces transient policy deny. This patch fixes this by making all endpoints wait for ipcache map sync. Fixes: cilium#15878 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
7d15312
to
5c67642
Compare
Thanks for the investigation! Rebased |
test-me-please Job 'Cilium-PR-K8s-1.16-net-next' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
I guess this is #16399 ? |
Yes. So only failing test is a known flake and review requests are covered. Marking ready to merge. |
Nice 🌞 , will it be backported to 1.9 or only 1.10 ? |
We'll backport to both. |
1.9 backport for this commit isn't straightforward, see this thread - #17328 (comment). |
That's what @yorg1st told me when he was trying to deploy on their clusters with the fix. Up to you but it's probably safer to backport only for 1.10. |
Please see commit msg
Fixes: #15878