Skip to content

Conversation

julianwiedmann
Copy link
Member

@julianwiedmann julianwiedmann commented Jan 10, 2025

The code to handle EGW reply traffic lives in the nodeport ingress path, after a packet has been RevSNATed. In contrast to the outbound path, it doesn't consider whether the packet's source is an in-cluster entity.

This becomes problematic when the EGW policy specifies a broad-range destinationCIDR, such as 0.0.0.0/0, and accidentally matches in-cluster service traffic. More precisely:

  • a client pod matches an EGW policy of the form "from $client_pod to 0.0.0.0/0, exit via gateway nodeB"
  • client_pod accesses a nodeport service on gateway nodeB, where the request is DNATed to a remote backend, and SNATed to nodeB's IP.
  • once the backend's reply reaches nodeB, it is revSNATed to client_pod. At this point the packet matches the EGW policy entry, and is thus redirected into the overlay network and forwarded to client_pod.
  • As the RevDNAT step has been skipped, client_pod rejects this packet from an unknown source.

Fix this by applying the RevDNAT logic prior to looking for EGW policy matches. Double-checking whether the packet's origin is an in-cluster entity might still make sense, but no longer strictly necessary.

From a tcpdump on a kind cluster (with EGW, KPR without SocketLB and native routing):

# client accesses nodeport 31936 on remote LB node
IP 10.244.1.236.35556 > 172.18.0.2.31936: Flags [S]
# request is DNATed + SNATed, and forwarded to remote backend
IP 172.18.0.2.35556 > 10.244.3.121.80: Flags [S]
# backend's reply reaches the LB node
IP 10.244.3.121.80 > 172.18.0.2.35556: Flags [S.]
# LB node applies revSNAT, adds VXLAN encapsulation and forwards to client
IP 172.18.0.2.54624 > 172.18.0.4.8472: OTV, flags [I] (0x08), overlay 0, instance 6
IP 10.244.3.121.80 > 10.244.1.236.35556: Flags [S.]
# client's RST to the backend
IP 10.244.1.236.35556 > 10.244.3.121.80: Flags [R]
Fix a bug that prevents a pod from accessing Nodeport services when the pod is also in scope of a broad-range Egress Gateway policy.

@julianwiedmann julianwiedmann added kind/bug This is a bug in the Cilium logic. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. feature/egress-gateway Impacts the egress IP gateway feature. area/loadbalancing Impacts load-balancing and Kubernetes service implementations needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 10, 2025
@julianwiedmann julianwiedmann added affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch labels Jan 10, 2025
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann added the dont-merge/preview-only Only for preview or testing, don't merge it. label Jan 10, 2025
@julianwiedmann julianwiedmann requested review from a team and pippolo84 and removed request for a team January 10, 2025 11:25
@julianwiedmann julianwiedmann marked this pull request as ready for review January 10, 2025 11:25
@julianwiedmann julianwiedmann requested a review from a team as a code owner January 10, 2025 11:25
@julianwiedmann
Copy link
Member Author

(I'll also want to cook a BPF-level test for this problem)

The code to handle EGW reply traffic lives in the nodeport ingress path,
after a packet has been RevSNATed. In contrast to the outbound path, it
doesn't consider whether the packet's source is an in-cluster entity.

This becomes problematic when the EGW policy specifies a broad-range
`destinationCIDR`, such as 0.0.0.0/0, and accidentally matches in-cluster
service traffic. More precisely:
* a client pod matches an EGW policy of the form
  "from $client_pod to 0.0.0.0/0, exit via gateway nodeB"
* client_pod accesses a nodeport service on gateway nodeB, where the
  request is DNATed to a remote backend, and SNATed to nodeB's IP.
* once the backend's reply reaches nodeB, it is revSNATed to client_pod.
  At this point the packet matches the EGW policy entry, and is thus
  redirected into the overlay network and forwarded to client_pod.
* As the RevDNAT step has been skipped, client_pod rejects this packet from
  an unknown source.

Fix this by applying the RevDNAT logic *prior* to looking for EGW policy
matches. Double-checking whether the packet's origin is an in-cluster
entity might still make sense, but no longer strictly necessary.


From a tcpdump on a kind cluster (with EGW, KPR without SocketLB and
native routing):

```
 # client accesses nodeport 31936 on remote LB node
IP 10.244.1.236.35556 > 172.18.0.2.31936: Flags [S]
 # request is DNATed + SNATed, and forwarded to remote backend
IP 172.18.0.2.35556 > 10.244.3.121.80: Flags [S]
 # backend's reply reaches the LB node
IP 10.244.3.121.80 > 172.18.0.2.35556: Flags [S.]
 # LB node applies revSNAT, adds VXLAN encapsulation and forwards to client
IP 172.18.0.2.54624 > 172.18.0.4.8472: OTV, flags [I] (0x08), overlay 0, instance 6
IP 10.244.3.121.80 > 10.244.1.236.35556: Flags [S.]
 # client's RST to the backend
IP 10.244.1.236.35556 > 10.244.3.121.80: Flags [R]
```

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Reply traffic for an SNATed in-cluster nodeport connection should not be
captured by a catch-all EGW policy. Otherwise the packet reaches the client
without the source being RevDNATed back to the nodeport, and thus
resulting in a RST by the client.

This adds testing for the scenario that was fixed in the preceding commit.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann julianwiedmann removed the dont-merge/preview-only Only for preview or testing, don't merge it. label Jan 13, 2025
@julianwiedmann
Copy link
Member Author

/test

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 14, 2025
@julianwiedmann julianwiedmann added this pull request to the merge queue Jan 14, 2025
Merged via the queue into cilium:main with commit a899c4a Jan 14, 2025
64 checks passed
@julianwiedmann julianwiedmann deleted the 1.18-egw-reply-fix branch January 14, 2025 08:30
@rastislavs rastislavs mentioned this pull request Jan 21, 2025
45 tasks
@rastislavs rastislavs added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 21, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Jan 22, 2025
@rastislavs rastislavs mentioned this pull request Jan 22, 2025
19 tasks
@rastislavs rastislavs added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Jan 22, 2025
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. feature/egress-gateway Impacts the egress IP gateway feature. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants