Skip to content

Conversation

anubhabMajumdar
Copy link
Contributor

Description

Currently, is a policy is applied before Cililum is installed (ex: through directory), agent fails to sync the policy for endpoint. This change fixes the issue.

Root cause of the above bug - #37724 (comment)
Added a detailed comment explaining the fix in the commit.

Testing done

  1. Add following CNP to kind-worker node
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: static-coredns
spec:
  endpointSelector:
    matchLabels:
      k8s:io.kubernetes.pod.namespace: kube-system
      k8s:k8s-app: kube-dns
  ingress:
  - fromEntities:
    - cluster
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      - port: "9153"
        protocol: TCP
  egress:
  - toEntities:
    - kube-apiserver
    - host
  - toCIDR:
    - 1.1.1.1/32
    - 8.8.8.8/32
  1. Install Cilium. Added following to agent daemonset
diff --git a/install/kubernetes/cilium/templates/cilium-agent/daemonset.yaml b/install/kubernetes/cilium/templates/cilium-agent/daemonset.yaml
index 8588506c50..9aabf3982e 100644
--- a/install/kubernetes/cilium/templates/cilium-agent/daemonset.yaml
+++ b/install/kubernetes/cilium/templates/cilium-agent/daemonset.yaml
@@ -126,6 +126,7 @@ spec:
         command:
         - cilium-agent
         args:
+        - --static-cnp-path=/policies
         - --config-dir=/tmp/cilium/config-map
         {{- with .Values.extraArgs }}
         {{- toYaml . | trim | nindent 8 }}
@@ -299,6 +300,9 @@ spec:
           {{- end }}
         terminationMessagePolicy: FallbackToLogsOnError
         volumeMounts:
+        - name: policies
+          mountPath: /policies
+          readOnly: false
         {{- if .Values.authentication.mutual.spire.enabled }}
         - name: spire-agent-socket
           mountPath: {{ dir .Values.authentication.mutual.spire.adminSocketPath }}
@@ -827,6 +831,10 @@ spec:
       {{- end }}
       {{- end }}
       volumes:
+      - name: policies
+        hostPath:
+          path: /tmp/policies
+          type: DirectoryOrCreate
  1. Check policy map
$ k exec -it $(kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium-agent -o wide | awk '$7=="kind-worker"{print $1}') -- cilium-dbg endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                  IPv6                  IPv4           STATUS
           ENFORCEMENT        ENFORCEMENT
159        Disabled           Disabled          1          reserved:host                                                                                                     ready
1711       Disabled           Disabled          4          reserved:health                                                              fd00:10:244:1::2486   10.244.1.216   ready
2580       Enabled            Enabled           59673      k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system   fd00:10:244:1::3a68   10.244.1.75    ready
                                                           k8s:io.cilium.k8s.policy.cluster=kind-kind
                                                           k8s:io.cilium.k8s.policy.serviceaccount=coredns
                                                           k8s:io.kubernetes.pod.namespace=kube-system
                                                           k8s:k8s-app=kube-dns
$
$
$  k exec -it $(kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium-agent -o wide | awk '$7=="kind-worker"{print $1}') -- cilium-dbg bpf policy get 2580
POLICY   DIRECTION   LABELS (source:key[=value])   PORT/PROTO   PROXY PORT   AUTH TYPE   BYTES    PACKETS   PREFIX
Allow    Ingress     ANY                           53/TCP       NONE         disabled    0        0         24
Allow    Ingress     ANY                           9153/TCP     NONE         disabled    0        0         24
Allow    Ingress     ANY                           53/UDP       NONE         disabled    0        0         24
Allow    Ingress     ANY                           53/SCTP      NONE         disabled    0        0         24
Allow    Ingress     reserved:host                 ANY          NONE         disabled    0        0         0
Allow    Egress      reserved:host                 ANY          NONE         disabled    0        0         0
Allow    Egress      reserved:kube-apiserver       ANY          NONE         disabled    461553   2015      0
                     reserved:remote-node
Allow    Egress      cidr:8.8.8.8/32               ANY          NONE         disabled    0        0         0
                     reserved:world-ipv4
Allow    Egress      cidr:1.1.1.1/32               ANY          NONE         disabled    39599    601       0
                     reserved:world-ipv4

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Fixes: #37724

@anubhabMajumdar anubhabMajumdar requested a review from a team as a code owner July 3, 2025 23:02
@anubhabMajumdar anubhabMajumdar requested a review from tklauser July 3, 2025 23:02
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 3, 2025
@anubhabMajumdar anubhabMajumdar force-pushed the topic/anmajumdar/policymapsync branch from d411d63 to 2571e83 Compare July 7, 2025 15:38
@anubhabMajumdar anubhabMajumdar requested review from a team as code owners July 7, 2025 15:38
@anubhabMajumdar anubhabMajumdar requested a review from squeed July 7, 2025 15:39
@anubhabMajumdar anubhabMajumdar force-pushed the topic/anmajumdar/policymapsync branch 3 times, most recently from 851885d to cb40019 Compare July 7, 2025 15:56
@anubhabMajumdar
Copy link
Contributor Author

/test

@santhoshmprabhu
Copy link

@anubhabMajumdar, could we check if we could add a UT that fails without your fix, and passes with?

@anubhabMajumdar
Copy link
Contributor Author

@anubhabMajumdar, could we check if we could add a UT that fails without your fix, and passes with?

I think the best way to make sure this doesn't regress in future would be to add an E2E test that validates that policies are synced to eBPF maps under all possible scenarios. I am going to file an issue to follow up once I check this in.

Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that policyMapSyncDone = true was done outside of the if block looks like a clear bug in retrospect, and moving into the end of the if block is the correct fix.

The rest of it is a matter of taste; as the comment says setting policyMapSyncDone = false is not necessary as it defaults to false when datapathRegenCtxt is created.

@jrajahalme jrajahalme added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jul 8, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jul 8, 2025
@jrajahalme jrajahalme removed the needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch label Jul 8, 2025
@jrajahalme
Copy link
Member

Marked as needing backport to 1.18. SInce this bug manifests only if there are policies read from the file system, further backport needs depend on in which Cilium version the support for importing policies from a directory appeared at.

@anubhabMajumdar
Copy link
Contributor Author

The fact that policyMapSyncDone = true was done outside of the if block looks like a clear bug in retrospect, and moving into the end of the if block is the correct fix.

Yes, this absolutely is a bug because we are setting policyMapSyncDone without actually syncing. In the other places in the code, this is set only after sync function returns no error.

@gandro gandro removed request for a team July 9, 2025 08:21
@anubhabMajumdar
Copy link
Contributor Author

anubhabMajumdar commented Jul 9, 2025

@jrajahalme Thaks for your review. I could reproduce the issue 1.17 onwards, so this would need backport to 1.17 as well.

@jrajahalme jrajahalme added the needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch label Jul 10, 2025
Copy link
Member

@fristonio fristonio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Catch <3
This additionally fixes another issue in Cilium startup behavior where initial policy is not synced correctly with BPF policy map. For example, if a policy is deleted while Cilium agent is down we don't cleanup the stale entries on startup.
Lets capture this scenario as well in the follow up issue for tests.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: netshoot
  labels:
    app: netshoot
spec:
  containers:
  - name: netshoot
    image: nicolaka/netshoot
    command: ["tail", "-f", "/dev/null"]
---
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: cluster-only-egress
spec:
  endpointSelector:
    matchLabels:
      app: netshoot
  egress:
  - toEntities:
    - cluster
EOF

sleep 30
kubectl -n kube-system exec -it $(get_cilium_pod kind-worker) -- cilium-dbg bpf policy get --all

kubectl -n kube-system rollout restart daemonset cilium && kubectl delete cnp cluster-only-egress

sleep 30
kubectl -n kube-system exec -it $(get_cilium_pod kind-worker) -- cilium-dbg bpf policy get --all

@tklauser tklauser enabled auto-merge July 10, 2025 15:01
@tklauser tklauser added the affects/v1.15 This issue affects v1.15 branch label Jul 10, 2025
@anubhabMajumdar
Copy link
Contributor Author

/ci-ipsec-e2e

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 29, 2025
@joestringer
Copy link
Member

joestringer commented Jul 29, 2025

Do we still have the same high failure rate with this PR as was observed before? It's a bit hard to tell based on the most recent failures reported above. Basically I'm wondering whether other changes in the tree were combining with this to exacerbate the failure rate, or if this PR by itself still triggers an increase in failure rate of some of the TLS tests. CI got a bit more stable after we merged #40691, so if the failures were related to that PR as much as this PR then I think that would be really valuable information to know.

@joestringer joestringer added dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jul 29, 2025
@joestringer
Copy link
Member

Recording the latest successful ci-e2e-upgrade run: https://github.com/cilium/cilium/actions/runs/16575187703

@joestringer
Copy link
Member

/ci-e2e-upgrade

@anubhabMajumdar
Copy link
Contributor Author

anubhabMajumdar commented Jul 29, 2025

Do we still have the same high failure rate with this PR as was observed before? It's a bit hard to tell based on the most recent failures reported above. Basically I'm wondering whether other changes in the tree were combining with this to exacerbate the failure rate, or if this PR by itself still triggers an increase in failure rate of some of the TLS tests. CI got a bit more stable after we merged #40691, so if the failures were related to that PR as much as this PR then I think that would be really valuable information to know.

Failed run - https://github.com/cilium/cilium/actions/runs/16606307675/job/46978909590

❌ 1/14 tests failed (1/295 actions), 6 tests skipped, 0 scenarios skipped:
Test [client-with-service-account-egress-to-echo]:
  🟥 client-with-service-account-egress-to-echo/pod-to-pod:curl-ipv6-0: cilium-test-1/client-645b68dcf7-6srn5 (fd00:10:244:1::1a25) -> cilium-test-1/echo-other-node-7f546db4f4-nzvtp (fd00:10:244:2::cad3:8080): command "curl --silent --fail --show-error --connect-timeout 2 --max-time 10 -6 -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code}\n --output /dev/null http://[fd00:10:244:2::cad3]:8080" failed: command failed (pod=cilium-test-1/client-645b68dcf7-6srn5, container=client): command terminated with exit code 28
    ⛑️ The following owners are responsible for reliability of the testsuite: 
        - @cilium/sig-agent (pod-to-pod)
        - @cilium/ci-structure (.github/workflows/tests-e2e-upgrade.yaml)

I haven't seen this failure before while testing this PR. The problematic test was client-egress-sni* which passed for all the scenarios.

Note: Will run once more to verify the SNI test again.

@anubhabMajumdar
Copy link
Contributor Author

/ci-e2e-upgrade

@anubhabMajumdar
Copy link
Contributor Author

/ci-e2e-upgrade

Passing run - https://github.com/cilium/cilium/actions/runs/16608366032

@anubhabMajumdar
Copy link
Contributor Author

Interesting, post rebasing on top of #40696 and this pipeline fix - #40691 , I am no longer seeing the issue with client-egress-sni* tests. I have 3 back-to-back passing runs.

@joestringer
Copy link
Member

From what I recall, we were getting at least some job failing each time we ran the CI suite last time right? Three in a row seems like a pretty high success rate.

I think it'd be worth filing an issue for the newly found failure from the second run, then maybe we can just merge this PR as-is.

@joestringer joestringer removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Jul 29, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 29, 2025
@joestringer joestringer added affects/v1.17 This issue affects v1.17 branch and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jul 29, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 29, 2025
@joestringer
Copy link
Member

I marked updated the backport labels based on the backport criteria and that this change has surprisingly wide impact - it's probably better to get the fix out in the hands of users via the latest release first as a way to minimize impact. I'd be open to revisiting down the road whether the severity is high enough on this one to backport further.

@joestringer joestringer added this pull request to the merge queue Jul 29, 2025
@joestringer
Copy link
Member

joestringer commented Jul 29, 2025

@anubhabMajumdar I appreciate your patience on this one, the contribution process here wasn't as smooth as I'd like it to be. Furthermore I think that the testing in this area of the code is so broad that it's tough to get confidence about a change like this. Despite just being a couple of lines, as we saw it seemed to have some butterfly effects elsewhere that raised some doubts about whether some other logic is misaligned. I'd be open to thoughts on what we might do better in future for cases like this, particularly if you think there are specific things we could do around improving testing for this codepath, or the processes around testing.

Merged via the queue into cilium:main with commit 2f90a8f Jul 29, 2025
68 checks passed
@anubhabMajumdar anubhabMajumdar deleted the topic/anmajumdar/policymapsync branch July 29, 2025 23:22
@rastislavs rastislavs mentioned this pull request Jul 31, 2025
14 tasks
@rastislavs rastislavs added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jul 31, 2025
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.17 This issue affects v1.17 branch backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

static policy watcher ignores files with multiple YAML objects
10 participants