Skip to content

Conversation

ldelossa
Copy link
Contributor

@ldelossa ldelossa commented Dec 4, 2024

This pull request moves the IPsec encryption hooks from pod egress to native device egress.

To accomplish this we must hairpin the traffic leaving the host back into the network stack.
When the traffic hairpins it is processed by XFRM and encryption takes place.
Once the encryption occurs the packet is recirculated back into the stack and stack routing will push it out the proper interface.

This achieves the goal of encrypting VXLAN traffic, in other words, the datapath will now send VXLAN-in-ESP traffic on the wire.

Enable VXLAN-in-ESP traffic for IPsec tunnels 

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 4, 2024
@ldelossa ldelossa added dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Dec 4, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 4, 2024
@jschwinger233
Copy link
Member

The 1st commit deletes ipsec code from lxc_bpf, which breaks git bisect I think? Can we add new ipsec code before deleting old code?

@ldelossa ldelossa force-pushed the ldelossa/ipsec-vxlan-in-esp branch from 51cb7a8 to d49a880 Compare December 4, 2024 14:26
@ldelossa
Copy link
Contributor Author

ldelossa commented Dec 4, 2024

@jschwinger233 I collapsed the relevant commits. I had them separated out more for testing purposes. They can be merged together.

@ldelossa ldelossa force-pushed the ldelossa/ipsec-vxlan-in-esp branch from d49a880 to ad50311 Compare December 4, 2024 16:09
@jschwinger233
Copy link
Member

jschwinger233 commented Dec 5, 2024

I think all those from_proxy things in bpf_host must be removed as well: https://github.com/cilium/cilium/blob/v1.17.0-pre.3/bpf/bpf_host.c

Otherwise we'll do double ipsec encryption: lxc -> L7 proxy -> bpf_host -> xfrm encryption (1st) -> ... -> to_netdev -> xfrm encryption (2nd) 😁

The from_proxy + ipsec code in bpf_host is to handle ipsec + L7 proxy. The classic path is lxc -> stack (with ipsec mark) -> xfrm encryption -> cilium_host -> ..., but with L7 proxy, the path will be lxc -> stack (with tproxy mark) -> tproxy hijack -> L7 proxy -> cilium_host -> stack (with ipsec mark) -> xfrm encryption -> ...; the from_proxy logic in bpf_host is to handle "L7 proxy -> cilium_host -> stack (with ipsec mark)" part.

@ldelossa
Copy link
Contributor Author

ldelossa commented Dec 5, 2024

@jschwinger233 Thank you! I was not too familiar with the l7 proxy flow.

So if I understand correctly, when a packet is destined for the l7 proxy, we go out the pod with a tproxy mark, and get delivered to the l7 proxy socket, after l7 proxy processing, the l7 proxy sets the from_proxy mark and I'm assuming some ip rule is setup to route that mark to bpf_host?

If that is the case, I think I understand. When the from_proxy bits get to bpf_host, we can simply remove the ipsec hooks for this case. When we do this bpf_host will do the next appropriate action (whether its redirect to tunnel device, or redirect to egress) and our new IPsec hooks at native device egress will handle the IPsec encryption bits from there.

Do I have that all correct?

@ldelossa
Copy link
Contributor Author

ldelossa commented Dec 5, 2024

Lots of failing eBPF unit tests now that IPsec hooks are not in bpf_lxc, going to begin tackling these now. At first blush I think we can simply remove them and replace with a test at bpf_host.

@jschwinger233
Copy link
Member

@jschwinger233 Thank you! I was not too familiar with the l7 proxy flow.

So if I understand correctly, when a packet is destined for the l7 proxy, we go out the pod with a tproxy mark, and get delivered to the l7 proxy socket, after l7 proxy processing, the l7 proxy sets the from_proxy mark and I'm assuming some ip rule is setup to route that mark to bpf_host?

If that is the case, I think I understand. When the from_proxy bits get to bpf_host, we can simply remove the ipsec hooks for this case. When we do this bpf_host will do the next appropriate action (whether its redirect to tunnel device, or redirect to egress) and our new IPsec hooks at native device egress will handle the IPsec encryption bits from there.

Do I have that all correct?

Correct!

@ldelossa ldelossa force-pushed the ldelossa/ipsec-vxlan-in-esp branch from ad50311 to 8f1673b Compare December 6, 2024 16:21
@ldelossa
Copy link
Contributor Author

ldelossa commented Dec 6, 2024

📋 Test Report [cilium-test-1]
❌ 2/64 tests failed (2/1021 actions), 44 tests skipped, 2 scenarios skipped:
Test [pod-to-pod-encryption]:
  ❌ pod-to-pod-encryption/pod-to-pod-encryption/curl-ipv4: cilium-test-1/client-7b7776c86b-6zxfp (10.244.1.148) -> cilium-test-1/echo-other-node-7c95b6845f-zn5z4 (10.244.2.138:8080)
Test [pod-to-pod-with-l7-policy-encryption]:
  ❌ pod-to-pod-with-l7-policy-encryption/pod-to-pod-encryption/curl-ipv4: cilium-test-1/client-7b7776c86b-6zxfp (10.244.1.148) -> cilium-test-1/echo-other-node-7c95b6845f-zn5z4 (10.244.2.138:8080)

Pod-to-pod encryption connectivity tests are failing. I haven't dug deeply but I would expect this. The tests are most likely expecting to see ESP traffic after VXLAN decap, but the packets are in plain text now after VXLAN decap since [ESP] -> [VXLAN] -> [Plain Text] is the process for VinE traffic.

@ldelossa ldelossa force-pushed the ldelossa/ipsec-vxlan-in-esp branch from 8f1673b to 381e815 Compare December 6, 2024 23:51
@ldelossa
Copy link
Contributor Author

ldelossa commented Dec 6, 2024

Latest force push includes new eBPF integration and unit tests for the native device egress hooks.

I will begin looking at the Cilium CLI connectivity test updates next.

@ldelossa ldelossa force-pushed the ldelossa/ipsec-vxlan-in-esp branch 3 times, most recently from e670b2b to 7bce967 Compare December 8, 2024 03:43
@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. feature/ipsec Relates to Cilium's IPsec feature labels Dec 10, 2024
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a quick pass, some comments inline

Comment on lines +220 to +240
if (!eth_is_supported_ethertype(proto))
return DROP_UNSUPPORTED_L2;

Copy link
Member

@julianwiedmann julianwiedmann Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already receiving the proto as parameter :). this didn't make sense, I was misreading the helper

I think we should simply let such traffic pass through, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this little bit of logic was ripped right from wireguard.h. Is it incorrect there as well?

Comment on lines -403 to -410
/* See IPv4 comment. */
if (from_proxy && info->tunnel_endpoint && encrypt_key)
return set_ipsec_encrypt(ctx, encrypt_key, info->tunnel_endpoint,
info->sec_identity, true, false);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the packet path for native-routing, connection was hijacked by the proxy. Where would the encryption be applied now, in to-netdev?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the discussion from this comment on, answer your question? #36345 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then we need to satisfy the ordering requirement that to-netdev needs to be regenerated before cilium-host is regenerated.

Re double-encryption - I think that's a problem we need to handle gracefully from within to-netdev either way. There will be packets in-flight inside the node that are either plain ESP (native routing), or ESP-in-VXLAN (overlay routing). The new logic in to-netdev needs to tolerate such traffic for one release cycle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to-netdev will see:

Vxlan routing:

  • plain-text: host-to-host traffic
  • plain-text-in-Vxlan: pod-to-pod traffic that requires further encryption
  • ESP-in-Vxlan: old lingering encryption traffic

native routing:

  • plain-text: pod-to-pod or host-to-any, that may require encryption
  • ESP: old lingering traffic

if (ctx_is_overlay(ctx)) {
ret = set_ipsec_encrypt(ctx, 0, ip4->daddr,
get_identity(ctx), false,
true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I can't find a counterexample in my mind: is it a correct assumption that all pod-to-pod traffic will be encapped to vxlan in a vxlan+ipsec cilium?
Because in theory we should lookup ipcache to determine if the original packets are pod2pod, but now we go for this shortcut. Let's make sure we don't overlook any cornor case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when ipsec + egress gateway, to-netdev can see a vxlan skb but we shouldn't do ipsec encryption in this case, am I correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct since EGW utilizes VXLAN in its implementation. We may want to ask more specifically with someone who is intimate with EGW, but that seems right to me. We could be in VXLAN for EGW and I do not believe that traffic which is utilizing VXLAN for EGW purposes is subjected to encryption. We should confirm tho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we have access to the source security ID for tunnel packets, as bpf_overlay sets the src ID and a FROM_OVERLAY host mark. But I'm not sure if this helps, since we'd need some idea that the traffic is for EGW purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to-netdev will see quite a bunch of VXLAN traffic that's not perfectly associated with pod-to-pod traffic (EGW, DSR-Geneve, North/South LB-to-backend, ...). WG took a pragmatic approach to this, and it resulted in this nice list of cases 😁.

I wouldn't give this too much thought for now - if our encryption path can transparently handle such traffic, it's probably not worth adding all sorts of special "exclude from IPsec" detection for it.

As an example for EGW: on the GW node, this traffic would still end up in the from-overlay program after decryption, and receive its EGW handling there same as before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the idea that "if our encryption path can transparently handle such traffic", but I was concerned about unexpected behavior changes: e.g. customers may assume egw won't be affected by cilium upgrade, but in fact it could: previously egw vxlan is plain-text, after upgrade it may be ipsec encrypted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I think there we are lucky that most (all?) of these "unexpected" cases are associated with ENABLE_NODEPORT, which is still a very new feature in combination with ENABLE_IPSEC.

To fully support Encrypted Overlay and moving IPsec encrypt hooks to the
egress device we must ensure the `to-netdev` program exists on native
devices.

Update the `DaemonConfig.AreDevicesRequired` function to return `true`
when IPsec is enabled.

This is a more general option which encompasses the `EnableIPSecEncryptedOverlay`
flag.

Moving forward Encrypted Overlay will be the default and this flag will
be removed.

Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
@ldelossa ldelossa force-pushed the ldelossa/ipsec-vxlan-in-esp branch from 193fbbd to f1f7758 Compare December 13, 2024 01:51
@ldelossa
Copy link
Contributor Author

Latest push has both tunnel and native routing working including IPv6.

@ldelossa ldelossa force-pushed the ldelossa/ipsec-vxlan-in-esp branch 2 times, most recently from 78402db to 2846d42 Compare December 16, 2024 18:33
Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
Remove the tests which are no longer valid for IPsec.

This includes the group of tests which focused on bpf_lxc and encrypted
overlay.

IPsec encryption hooks no longer run at bpf_lxc and encrypted overlay is
deprecated.

Add two new tests:
1. Integration test with bpf_host which ensures the IPsec hooks are
   called for traffic about to leave the host
2. Unit tests for the IPsec encryption hook and its hairpin
   functionality.

The latter ensures the `ctx` is prepared correctly for the hairpin and
a call to ctx_redirect is made to recirculate the packet into the stack
for XFRM processing

Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
IPsec VinE traffic can follow the same testing patterns as Wireguard
sans the Host-to-Host encryption parts.

Update the encryption connectivity to perform a more abstract "Encrypted
Overlay" check. This is an general term which expresses that traffic is first
tunnel encapsulated and then it is encrypted. With IPsec's move to VinE
traffic this means both WG and IPsec support encrypted overlay past
specific versions of Cilium. If the check is true then tests will sniff
on the native device and expect to see no plain tunnel packets.

Signed-off-by: Louis DeLosSantos <louis.delos@isovalent.com>
@ldelossa ldelossa force-pushed the ldelossa/ipsec-vxlan-in-esp branch from 2846d42 to d856b43 Compare December 17, 2024 15:54
@ldelossa
Copy link
Contributor Author

ldelossa commented Dec 17, 2024

Latest push now hairpins the packet back into the host at ingress@cilium_net.
This also means we can remove the commit which toggles accept_local sysctl for native devices when IPsec was enabled.

@ldelossa
Copy link
Contributor Author

Connectivity tests for l7 pod2pod encryption, in native routing, are failing. Tunnel mode is fine.
Going to investigate this next...

@aanm aanm removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Dec 18, 2024
Copy link

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jan 18, 2025
Copy link

github-actions bot commented Feb 1, 2025

This pull request has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. feature/ipsec Relates to Cilium's IPsec feature release-note/minor This PR changes functionality that users may find relevant to operating Cilium. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants