Skip to content

Conversation

julianwiedmann
Copy link
Member

Consider a pod-to-remote-hostport connection, in a cluster with native routing. Requests are DNATed by from-netdev when entering the remote node, and forwarded to the local backend. Replies are RevDNATed in to-netdev when flowing back to the client pod. But in the past this reply path didn't work when IPsec was enabled - the from-container program would select the reply packets for encryption and pass them through the stack for XFRM processing. By the time that to-netdev observes the packet, its content is encrypted and can no longer be RevDNATed. Therefore bpf_lxc contained special logic that would mark the connection's CT entry as '.node_port = 1' - which allowed the reply path in bpf_lxc to immediately tail-call into the RevDNAT code, prior to encryption.

The same also applies for a pod-to-remote-nodeport connection (with SocketLB disabled) when the remote node selects a local backend.

But with the changes from #37723 we can now trust that RevDNAT happens in cil_to_host(), prior to encryption. Therefore we no longer need to mark such connections as "handle RevDNAT early".

Existing connections with .node_port = 1 flag are handled as before, the relevant code in the bpf_lxc reply path can be removed / constrained in a future release.

Consider a pod-to-remote-hostport connection, in a cluster with native
routing. Requests are DNATed by from-netdev when entering the remote node,
and forwarded to the local backend. Replies are RevDNATed in to-netdev when
flowing back to the client pod. But in the past this reply path didn't work
when IPsec was enabled - the from-container program would select the reply
packets for encryption and pass them through the stack for XFRM processing.
By the time that to-netdev observes the packet, its content is encrypted
and can no longer be RevDNATed. Therefore bpf_lxc contained special logic
that would mark the connection's CT entry as '.node_port = 1' - which
allowed the reply path in bpf_lxc to immediately tail-call into the
RevDNAT code, prior to encryption.

The same also applies for a pod-to-remote-nodeport connection
(with SocketLB disabled) when the remote node selects a local backend.

But with the changes from #37723
we can now trust that RevDNAT happens in cil_to_host(), prior to
encryption. Therefore we no longer need to mark such connections as
"handle RevDNAT early".

Existing connections with `.node_port = 1` flag are handled as before, the
relevant code in the bpf_lxc reply path can be removed / constrained in a
future release.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/ipsec Relates to Cilium's IPsec feature area/kpr Anything related to our kube-proxy replacement. labels Sep 3, 2025
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann marked this pull request as ready for review September 3, 2025 06:59
@julianwiedmann julianwiedmann requested a review from a team as a code owner September 3, 2025 06:59
@julianwiedmann
Copy link
Member Author

@jschwinger233 👋 maybe you want to do the honors on this one? It's taken long enough 😄

Copy link
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My honor 😆

@julianwiedmann julianwiedmann added this pull request to the merge queue Sep 3, 2025
Merged via the queue into main with commit 6a7570d Sep 3, 2025
335 of 341 checks passed
@julianwiedmann julianwiedmann deleted the pr/jwi/main/bpf-lxc-ipsec-revdnat branch September 3, 2025 07:18
julianwiedmann added a commit that referenced this pull request Sep 4, 2025
#41487 and
#41464 removed the last bits of
IPsec-related code in bpf_lxc. Strip down the compile & load testing
accordingly.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Sep 5, 2025
#41487 and
#41464 removed the last bits of
IPsec-related code in bpf_lxc. Strip down the compile & load testing
accordingly.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/kpr Anything related to our kube-proxy replacement. area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/ipsec Relates to Cilium's IPsec feature release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants