-
Notifications
You must be signed in to change notification settings - Fork 3.4k
bpf: lxc: don't special-case the RevDNAT path for IPsec configs #41487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Consider a pod-to-remote-hostport connection, in a cluster with native routing. Requests are DNATed by from-netdev when entering the remote node, and forwarded to the local backend. Replies are RevDNATed in to-netdev when flowing back to the client pod. But in the past this reply path didn't work when IPsec was enabled - the from-container program would select the reply packets for encryption and pass them through the stack for XFRM processing. By the time that to-netdev observes the packet, its content is encrypted and can no longer be RevDNATed. Therefore bpf_lxc contained special logic that would mark the connection's CT entry as '.node_port = 1' - which allowed the reply path in bpf_lxc to immediately tail-call into the RevDNAT code, prior to encryption. The same also applies for a pod-to-remote-nodeport connection (with SocketLB disabled) when the remote node selects a local backend. But with the changes from #37723 we can now trust that RevDNAT happens in cil_to_host(), prior to encryption. Therefore we no longer need to mark such connections as "handle RevDNAT early". Existing connections with `.node_port = 1` flag are handled as before, the relevant code in the bpf_lxc reply path can be removed / constrained in a future release. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
/test |
@jschwinger233 👋 maybe you want to do the honors on this one? It's taken long enough 😄 |
jschwinger233
approved these changes
Sep 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My honor 😆
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
area/kpr
Anything related to our kube-proxy replacement.
area/loadbalancing
Impacts load-balancing and Kubernetes service implementations
feature/ipsec
Relates to Cilium's IPsec feature
release-note/misc
This PR makes changes that have no direct user impact.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Consider a pod-to-remote-hostport connection, in a cluster with native routing. Requests are DNATed by from-netdev when entering the remote node, and forwarded to the local backend. Replies are RevDNATed in to-netdev when flowing back to the client pod. But in the past this reply path didn't work when IPsec was enabled - the from-container program would select the reply packets for encryption and pass them through the stack for XFRM processing. By the time that to-netdev observes the packet, its content is encrypted and can no longer be RevDNATed. Therefore bpf_lxc contained special logic that would mark the connection's CT entry as '.node_port = 1' - which allowed the reply path in bpf_lxc to immediately tail-call into the RevDNAT code, prior to encryption.
The same also applies for a pod-to-remote-nodeport connection (with SocketLB disabled) when the remote node selects a local backend.
But with the changes from #37723 we can now trust that RevDNAT happens in cil_to_host(), prior to encryption. Therefore we no longer need to mark such connections as "handle RevDNAT early".
Existing connections with
.node_port = 1
flag are handled as before, the relevant code in the bpf_lxc reply path can be removed / constrained in a future release.