-
Notifications
You must be signed in to change notification settings - Fork 3.4k
fix(GH-37724): Sync policies on startup #40357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(GH-37724): Sync policies on startup #40357
Conversation
d411d63
to
2571e83
Compare
851885d
to
cb40019
Compare
/test |
@anubhabMajumdar, could we check if we could add a UT that fails without your fix, and passes with? |
I think the best way to make sure this doesn't regress in future would be to add an E2E test that validates that policies are synced to eBPF maps under all possible scenarios. I am going to file an issue to follow up once I check this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that policyMapSyncDone = true
was done outside of the if
block looks like a clear bug in retrospect, and moving into the end of the if
block is the correct fix.
The rest of it is a matter of taste; as the comment says setting policyMapSyncDone = false
is not necessary as it defaults to false
when datapathRegenCtxt
is created.
Marked as needing backport to 1.18. SInce this bug manifests only if there are policies read from the file system, further backport needs depend on in which Cilium version the support for importing policies from a directory appeared at. |
Yes, this absolutely is a bug because we are setting |
@jrajahalme Thaks for your review. I could reproduce the issue 1.17 onwards, so this would need backport to 1.17 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Catch <3
This additionally fixes another issue in Cilium startup behavior where initial policy is not synced correctly with BPF policy map. For example, if a policy is deleted while Cilium agent is down we don't cleanup the stale entries on startup.
Lets capture this scenario as well in the follow up issue for tests.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: netshoot
labels:
app: netshoot
spec:
containers:
- name: netshoot
image: nicolaka/netshoot
command: ["tail", "-f", "/dev/null"]
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: cluster-only-egress
spec:
endpointSelector:
matchLabels:
app: netshoot
egress:
- toEntities:
- cluster
EOF
sleep 30
kubectl -n kube-system exec -it $(get_cilium_pod kind-worker) -- cilium-dbg bpf policy get --all
kubectl -n kube-system rollout restart daemonset cilium && kubectl delete cnp cluster-only-egress
sleep 30
kubectl -n kube-system exec -it $(get_cilium_pod kind-worker) -- cilium-dbg bpf policy get --all
/ci-ipsec-e2e |
Do we still have the same high failure rate with this PR as was observed before? It's a bit hard to tell based on the most recent failures reported above. Basically I'm wondering whether other changes in the tree were combining with this to exacerbate the failure rate, or if this PR by itself still triggers an increase in failure rate of some of the TLS tests. CI got a bit more stable after we merged #40691, so if the failures were related to that PR as much as this PR then I think that would be really valuable information to know. |
Recording the latest successful ci-e2e-upgrade run: https://github.com/cilium/cilium/actions/runs/16575187703 |
/ci-e2e-upgrade |
Failed run - https://github.com/cilium/cilium/actions/runs/16606307675/job/46978909590
I haven't seen this failure before while testing this PR. The problematic test was Note: Will run once more to verify the SNI test again. |
/ci-e2e-upgrade |
Passing run - https://github.com/cilium/cilium/actions/runs/16608366032 |
From what I recall, we were getting at least some job failing each time we ran the CI suite last time right? Three in a row seems like a pretty high success rate. I think it'd be worth filing an issue for the newly found failure from the second run, then maybe we can just merge this PR as-is. |
I marked updated the backport labels based on the backport criteria and that this change has surprisingly wide impact - it's probably better to get the fix out in the hands of users via the latest release first as a way to minimize impact. I'd be open to revisiting down the road whether the severity is high enough on this one to backport further. |
@anubhabMajumdar I appreciate your patience on this one, the contribution process here wasn't as smooth as I'd like it to be. Furthermore I think that the testing in this area of the code is so broad that it's tough to get confidence about a change like this. Despite just being a couple of lines, as we saw it seemed to have some butterfly effects elsewhere that raised some doubts about whether some other logic is misaligned. I'd be open to thoughts on what we might do better in future for cases like this, particularly if you think there are specific things we could do around improving testing for this codepath, or the processes around testing. |
Description
Currently, is a policy is applied before Cililum is installed (ex: through directory), agent fails to sync the policy for endpoint. This change fixes the issue.
Root cause of the above bug - #37724 (comment)
Added a detailed comment explaining the fix in the commit.
Testing done
kind-worker
nodePlease ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Fixes: #37724