-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[IPsec] Enable VXLAN-in-ESP traffic #36345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The 1st commit deletes ipsec code from lxc_bpf, which breaks |
51cb7a8
to
d49a880
Compare
@jschwinger233 I collapsed the relevant commits. I had them separated out more for testing purposes. They can be merged together. |
d49a880
to
ad50311
Compare
I think all those Otherwise we'll do double ipsec encryption: lxc -> L7 proxy -> bpf_host -> xfrm encryption (1st) -> ... -> to_netdev -> xfrm encryption (2nd) 😁 The from_proxy + ipsec code in bpf_host is to handle ipsec + L7 proxy. The classic path is lxc -> stack (with ipsec mark) -> xfrm encryption -> cilium_host -> ..., but with L7 proxy, the path will be lxc -> stack (with tproxy mark) -> tproxy hijack -> L7 proxy -> cilium_host -> stack (with ipsec mark) -> xfrm encryption -> ...; the |
@jschwinger233 Thank you! I was not too familiar with the l7 proxy flow. So if I understand correctly, when a packet is destined for the l7 proxy, we go out the pod with a tproxy mark, and get delivered to the l7 proxy socket, after l7 proxy processing, the l7 proxy sets the If that is the case, I think I understand. When the Do I have that all correct? |
Lots of failing eBPF unit tests now that IPsec hooks are not in bpf_lxc, going to begin tackling these now. At first blush I think we can simply remove them and replace with a test at bpf_host. |
Correct! |
ad50311
to
8f1673b
Compare
Pod-to-pod encryption connectivity tests are failing. I haven't dug deeply but I would expect this. The tests are most likely expecting to see ESP traffic after VXLAN decap, but the packets are in plain text now after VXLAN decap since [ESP] -> [VXLAN] -> [Plain Text] is the process for VinE traffic. |
8f1673b
to
381e815
Compare
Latest force push includes new eBPF integration and unit tests for the native device egress hooks. I will begin looking at the Cilium CLI connectivity test updates next. |
e670b2b
to
7bce967
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took a quick pass, some comments inline
if (!eth_is_supported_ethertype(proto)) | ||
return DROP_UNSUPPORTED_L2; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already receiving the this didn't make sense, I was misreading the helperproto
as parameter :).
I think we should simply let such traffic pass through, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this little bit of logic was ripped right from wireguard.h
. Is it incorrect there as well?
/* See IPv4 comment. */ | ||
if (from_proxy && info->tunnel_endpoint && encrypt_key) | ||
return set_ipsec_encrypt(ctx, encrypt_key, info->tunnel_endpoint, | ||
info->sec_identity, true, false); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the packet path for native-routing, connection was hijacked by the proxy
. Where would the encryption be applied now, in to-netdev
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the discussion from this comment on, answer your question? #36345 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then we need to satisfy the ordering requirement that to-netdev
needs to be regenerated before cilium-host
is regenerated.
Re double-encryption - I think that's a problem we need to handle gracefully from within to-netdev
either way. There will be packets in-flight inside the node that are either plain ESP (native routing), or ESP-in-VXLAN (overlay routing). The new logic in to-netdev
needs to tolerate such traffic for one release cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to-netdev will see:
Vxlan routing:
- plain-text: host-to-host traffic
- plain-text-in-Vxlan: pod-to-pod traffic that requires further encryption
- ESP-in-Vxlan: old lingering encryption traffic
native routing:
- plain-text: pod-to-pod or host-to-any, that may require encryption
- ESP: old lingering traffic
if (ctx_is_overlay(ctx)) { | ||
ret = set_ipsec_encrypt(ctx, 0, ip4->daddr, | ||
get_identity(ctx), false, | ||
true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I can't find a counterexample in my mind: is it a correct assumption that all pod-to-pod traffic will be encapped to vxlan in a vxlan+ipsec cilium?
Because in theory we should lookup ipcache to determine if the original packets are pod2pod, but now we go for this shortcut. Let's make sure we don't overlook any cornor case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when ipsec + egress gateway, to-netdev can see a vxlan skb but we shouldn't do ipsec encryption in this case, am I correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are correct since EGW utilizes VXLAN in its implementation. We may want to ask more specifically with someone who is intimate with EGW, but that seems right to me. We could be in VXLAN for EGW and I do not believe that traffic which is utilizing VXLAN for EGW purposes is subjected to encryption. We should confirm tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we have access to the source security ID for tunnel packets, as bpf_overlay
sets the src ID and a FROM_OVERLAY host mark. But I'm not sure if this helps, since we'd need some idea that the traffic is for EGW purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to-netdev
will see quite a bunch of VXLAN traffic that's not perfectly associated with pod-to-pod traffic (EGW, DSR-Geneve, North/South LB-to-backend, ...). WG took a pragmatic approach to this, and it resulted in this nice list of cases 😁.
I wouldn't give this too much thought for now - if our encryption path can transparently handle such traffic, it's probably not worth adding all sorts of special "exclude from IPsec" detection for it.
As an example for EGW: on the GW node, this traffic would still end up in the from-overlay
program after decryption, and receive its EGW handling there same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the idea that "if our encryption path can transparently handle such traffic", but I was concerned about unexpected behavior changes: e.g. customers may assume egw won't be affected by cilium upgrade, but in fact it could: previously egw vxlan is plain-text, after upgrade it may be ipsec encrypted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I think there we are lucky that most (all?) of these "unexpected" cases are associated with ENABLE_NODEPORT
, which is still a very new feature in combination with ENABLE_IPSEC
.
To fully support Encrypted Overlay and moving IPsec encrypt hooks to the egress device we must ensure the `to-netdev` program exists on native devices. Update the `DaemonConfig.AreDevicesRequired` function to return `true` when IPsec is enabled. This is a more general option which encompasses the `EnableIPSecEncryptedOverlay` flag. Moving forward Encrypted Overlay will be the default and this flag will be removed. Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
193fbbd
to
f1f7758
Compare
Latest push has both tunnel and native routing working including IPv6. |
78402db
to
2846d42
Compare
Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
Remove the tests which are no longer valid for IPsec. This includes the group of tests which focused on bpf_lxc and encrypted overlay. IPsec encryption hooks no longer run at bpf_lxc and encrypted overlay is deprecated. Add two new tests: 1. Integration test with bpf_host which ensures the IPsec hooks are called for traffic about to leave the host 2. Unit tests for the IPsec encryption hook and its hairpin functionality. The latter ensures the `ctx` is prepared correctly for the hairpin and a call to ctx_redirect is made to recirculate the packet into the stack for XFRM processing Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
IPsec VinE traffic can follow the same testing patterns as Wireguard sans the Host-to-Host encryption parts. Update the encryption connectivity to perform a more abstract "Encrypted Overlay" check. This is an general term which expresses that traffic is first tunnel encapsulated and then it is encrypted. With IPsec's move to VinE traffic this means both WG and IPsec support encrypted overlay past specific versions of Cilium. If the check is true then tests will sniff on the native device and expect to see no plain tunnel packets. Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
2846d42
to
d856b43
Compare
Latest push now hairpins the packet back into the host at |
Connectivity tests for l7 pod2pod encryption, in native routing, are failing. Tunnel mode is fine. |
This pull request has been automatically marked as stale because it |
This pull request has not seen any activity since it was marked stale. |
This pull request moves the IPsec encryption hooks from pod egress to native device egress.
To accomplish this we must hairpin the traffic leaving the host back into the network stack.
When the traffic hairpins it is processed by XFRM and encryption takes place.
Once the encryption occurs the packet is recirculated back into the stack and stack routing will push it out the proper interface.
This achieves the goal of encrypting VXLAN traffic, in other words, the datapath will now send VXLAN-in-ESP traffic on the wire.