Skip to content

Conversation

gyutaeb
Copy link
Contributor

@gyutaeb gyutaeb commented Jun 10, 2024

This is in the context of the lookup ipip tunnel feature described at #33020.
image

This pull request extends the datapath to be able to decapsulate the IPIP tunnel, when the inner IP header and l4 tuple are matched the kube-proxy services.

@gyutaeb gyutaeb requested review from a team as code owners June 10, 2024 16:21
@gyutaeb gyutaeb requested review from lmb and gentoo-root June 10, 2024 16:21
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 10, 2024
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jun 10, 2024
@gyutaeb gyutaeb changed the title bpf: Decapsulate IPIP tunnel traffic for kube-proxy Lookup IPIP tunnel: Chapter 1 Jun 10, 2024
Copy link
Contributor

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loader changes seem benign. Seems like the PR is missing tests though?

@lmb lmb added the area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Jun 11, 2024
@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jun 11, 2024

@lmb Thank you for the review! Could you please clarify "tests" - if you meant a result of the existing test codes or a new test code for this PR?

In case you meant a new test code, I haven't added the test code for this PR yet. I've split the PR into chapter 1. I was going to add the bpf test code in the next chapter. Would that be okay?

@lmb
Copy link
Contributor

lmb commented Jun 11, 2024

I was going to add the bpf test code in the next chapter. Would that be okay?

How can we tell that the code we're merging is actually correct? Having tests is a requirement.

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jun 12, 2024

@lmb Okay. I am going to add test code to this PR. Thanks for letting me know :)

@gyutaeb gyutaeb force-pushed the enable-lookup-ipip branch 2 times, most recently from 4e99c55 to 03774eb Compare June 12, 2024 07:44
@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jun 12, 2024

Correcting author & committer in 03774eb

@borkmann
Copy link
Member

Hello, we do have --enable-ipip-termination in the agent already. Could you piggy-back on this instead of reimplementing? (#31213)

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jun 13, 2024

Thank you @borkmann. I checked #31213. It seems to only have handling for the config. I am going to use c.EnableIPIPTermination to control the datapath feature for this PR.

c.EnableIPIPTermination = vp.GetBool(EnableIPIPTermination)

@gyutaeb gyutaeb force-pushed the enable-lookup-ipip branch 3 times, most recently from 657e269 to 3ada7b9 Compare June 17, 2024 15:09
@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jun 17, 2024

@lmb I add the test code. 3ada7b9 Could you review the test code and PR?

@lmb
Copy link
Contributor

lmb commented Jun 18, 2024

I've already reviewed the part of the code I'm familiar with. @gentoo-root is responsible for the overall review here.

@YakhontovYaroslav
Copy link

YakhontovYaroslav commented Jun 19, 2024

It turns out there is similar issue already opened #32473 with similar regression regarding kube-proxy replacement.
Also, this #30547 PR from @borkmann lays unmerged that resolves quite the same problem, but with the difference it already handling IP6 packets.
I'm now investigating how to make flow I've described here #32473 (comment) and it would be nice if we can merge work done in both PRs to make both flows possible.

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jun 20, 2024

I believe #30547 PR from @borkmann and this PR are similar but have structural differences.

#30547 checks the inner tuple in nodeport_lb4() and performs IPIP termination. However, this PR only inspects packets coming from-netdev.

Additionally, #30547 seems to support DSR for pods, where IPIP-terminated packets reach the pod, and the pod performs direct server return. In contrast, this PR does not involve the pod's direct server return. IPIP-terminated packets are NATed and unNATed by the kube-proxy logic.

@gyutaeb gyutaeb force-pushed the enable-lookup-ipip branch 2 times, most recently from 17adf0a to 8577905 Compare June 20, 2024 13:14
@gyutaeb gyutaeb changed the title Lookup IPIP tunnel: Chapter 1 IPIP Termination: Chapter 1 Jun 20, 2024
@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jun 20, 2024

Hello, we do have --enable-ipip-termination in the agent already. Could you piggy-back on this instead of reimplementing? (#31213)

I've modified C define and file names to piggy-back --enable-ipip-termination (#31213). Thank you, @borkmann.

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jun 21, 2024

@gentoo-root ping? :) Could you review this PR? 🙏

@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 5, 2024
@gyutaeb
Copy link
Contributor Author

gyutaeb commented Sep 5, 2024

@joestringer Could you please run /test again? I've rebased the PR branch to facilitate a review.

@joestringer
Copy link
Member

/test

return DROP_INVALID;
}

return CTX_ACT_OK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decapsulation occurs when the inner IP header and l4 tuple are matched the kube-proxy services. The goal of this decapsulation is cilium's kube-proxy can handle ipip tunnel traffic.

Sorry for the big delay/backlog on my side. What is the rationale for the above? I'm mainly asking as this means service traffic needs to do the service lookup cost twice in the fast path, first in decap_ipip() and then later on once again in the regular service handling code.

I'm still somewhat in favor of just attaching BPF to the cilium_ipip device and let it sort out the rest of the service handling. So it will decap traffic that hits the node no matter what and if some of it is service traffic then we handle it too, similarly for other policies outside of services. In this here we're basically just pushing these IPIP encapped up the stack, I wonder if this seems valid.

Iff the similar route as with vxlan/geneve doesn't work out, maybe we could revisit, but also unconditionally just decap - what speaks against that? Btw, from kernel side I was looking into GRO/GSO handling, I think at least the GRO part is independent of device or no device as the GRO engine should just look at the headers, aggregate, and then send the bigger pkt up. For GSO, I haven't checked yet, did you look into it? Otoh, here we also only terminate, not encap.

Copy link
Contributor Author

@gyutaeb gyutaeb Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @borkmann :)

What is the rationale for the above? I'm mainly asking as this means service traffic needs to do the service lookup cost twice in the fast path, first in decap_ipip() and then later on once again in the regular service handling code.

I agree that decapsulation should work independently(no lookup service). The reason for doing the service lookup in decap_ipip() was a defensive programming. It was designed to protect against cases where K8s cluster nodes are used not only for K8s purposes but also for other uses. This includes situations where users run applications directly on the nodes and use tunnel devices. While this configuration is not ideal, the focus was on preventing potential conflicts as much as possible. (I realize this might be considering too many edge cases.)

I also looked into the kernel GRO part. Just as GRO is independent of the device, removing the dependency on service lookup in decap_ipip() could be a good alternative. This approach would involve performing unconditionally decapsulation. How about improving the PR this way? It would function similarly to cilium_ipip, without dependencies on other parts.

To summarize the overall plan, it is as follows:

  1. ENABLE_EXTLB_IPIP_TERMINATION supports decapsulation ipip without service lookup.
  2. Preparing to piggy-back --enable-ipip-termination. (cilium: Enable plain IPIP/IP6IP6 termination #31213)
    2.1. Implementing the existing bpf_overlay hook to be usable with tunnel devices.
    2.2. Modifying cilium-agent to allow it to attach new bpf_overlay.
  3. Deprecated item 1.

It would be a good way to move over to using --enable-ipip-termination meanwhile we deprecated ENABLE_EXTLB_IPIP_TERMINATION. And if it's okay, I would like to continue working on piggy-back #31213

For your reference, the development for IPv6 has been completed. Since IPv6 is based on the partial of this PR, it would be better to create a new PR once this PR is merged.

Copy link
Contributor Author

@gyutaeb gyutaeb Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@borkmann @gentoo-root ping? 😄 Could you please review this conversation and approach? I'm always grateful, thank you :)

I also looked into the kernel GRO part. Just as GRO is independent of the device, removing the dependency on service lookup in decap_ipip() could be a good alternative. This approach would involve performing unconditionally decapsulation. How about improving the PR this way? It would function similarly to cilium_ipip, without dependencies on other parts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay on my side, conference travels and too much backlog :/

Preparing to piggy-back --enable-ipip-termination. (#31213)
2.1. Implementing #33026 (comment) hook to be usable with tunnel devices.
2.2. Modifying cilium-agent to allow it to attach new bpf_overlay.

I think that sounds good, hopefully in the meantime you managed to do a PoC to see if this worked out?

@gyutaeb gyutaeb requested a review from borkmann September 27, 2024 11:35
@derailed
Copy link
Contributor

@gentoo-root this is waiting for your approval. Could you take another look?

@joestringer
Copy link
Member

I understood from discussion in the community meeting yesterday that the next step was for @gyutaeb to report back about the PoC in order to resolve the discussion in this thread. If @borkmann approves then we shouldn't need @gentoo-root to take another look.

@joestringer joestringer removed the request for review from gentoo-root October 24, 2024 18:01
@joestringer joestringer dismissed gentoo-root’s stale review October 24, 2024 18:02

No recent activity

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Oct 29, 2024

@borkmann @derailed @joestringer @gentoo-root
Thank you for the review and comments. Now I am currently working on a PoC #31213. I am verifying the operation of bpf_host.o throught the cilium_ipip interface ingress hook. It requires some modifications to the cilium-agent. I plan to complete this within the next few weeks and share the results with you as soon as possible.

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Nov 20, 2024

@borkmann Thank you for your patience. Let me share the PoC results. The tests were conducted on the stable version, v1.16.3.

An arbitrary tunl0:0 device was added, and the cil_from_netdev-tunl0 was injected into the ingress TC hook.

tunl0:0: flags=193<UP,RUNNING,NOARP>  mtu 1480
        inet 10.10.10.10  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)

# tc filter show dev tunl0:0 ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_from_netdev-tunl0 direct-action not_in_hw id 2384 tag 9a04780ea6522a6c jited

This can be illustrated as follows:

----------------------------------------------
| tunl0:0 (10.10.10.10) | cil_from_netdev-tunl0 |
-----------------------------------------------
| eth0 (192.168.0.100) | cil_from_netdev-eth0 |
-----------------------------------------------
outer: LB Real IP -> 192.168.0.100
inner: ClientIP   -> 10.10.10.10

As we anticipated, the IPIP device performs decapsulation, and the ingress TC hook that follows handles the service. I've confirmed successful end-to-end testing using IPIP external load balancers.

The original goal was to archive the above results using #31213. However, an error occurred when attempting to attach the eBPF program to the cilium_ipip4 device created using --enable-ipip-termination. The following error occurs in reloadHostDatapath() in loader.go

Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=466 error="loading eBPF collection into the kernel: inlining global data: can't override non-existent constant \"ETH_HLEN\"" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint

It seems likely that by modifying the code to pass ebpf.CollectionSpec, I could achieve successful results with the cilium_ipip{4,6} devices created by #31213. However, since I have no prior experience working with eBPF spec code, it might take me few more weeks to complete the implementation.

Therefore, I would propose proceeding in the following order:

  1. Merge this PR: This is because ENABLE_EXTLB_IPIP_TERMINATION is merely a build flag, which requires explicit define to be used. This will not be exposed as a configurable option. As a result, it does not affect other datapath logic while providing significant value to other Cilium users who have been waiting for this feature. (as we discussed, this feature performs just decapsulation and does not do service lookup)
  2. Continue working on the PoC mentiond above. Enable the cil_from_netdev program to be loaded onto cilium_ipip{4,6}
  3. Create a helm value for --enable-ipip-termination and develop the related functionality and documentation.

@gyutaeb gyutaeb force-pushed the enable-lookup-ipip branch 3 times, most recently from eda6050 to 141ca3d Compare November 22, 2024 10:58
@gyutaeb
Copy link
Contributor Author

gyutaeb commented Nov 22, 2024

Through the above 141ca3d, this PR has been adjusted to perform only decapsulation and rebased. Also, I've clarified the commit message to explicitly state that this will be deprecated by #31213.

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Nov 22, 2024

@joestringer It's been a while since I last reached out. Could you please run /test again? I've made some improvements to the PR.

@gentoo-root
Copy link
Contributor

/test

When ENABLE_EXTLB_IPIP_TERMINATION is used as a build flag,
cil_from_netdev() decapsulate ipip tunnel traffic. The goal of this
decapsulation is cilium's kube-proxy can handle ipip tunnel traffic.

A test verifies that ENABLE_EXTLB_IPIP_TERMINATION  works as expected.
It creates an IPIP tunnel packet between loadbalancer and node.
And verifies that the packet is correctly terminated in cil_from_netdev.

Note: This feature is expected to be deprecated in the future
by cilium#31213

Signed-off-by: Gyutae Bae <gyu.8ae@gmail.com>
@gyutaeb
Copy link
Contributor Author

gyutaeb commented Nov 25, 2024

@gentoo-root @joestringer Sorry for the failed tests. I encountered an issue while validating the commit myself. I've cleaned up unused variables and made changes to ensure the tests pass. Whenever you have time, could you please run the /test again?
Thank you for taking care of this.

@joestringer
Copy link
Member

/test

@borkmann
Copy link
Member

@borkmann Thank you for your patience. Let me share the PoC results. The tests were conducted on the stable version, v1.16.3.

An arbitrary tunl0:0 device was added, and the cil_from_netdev-tunl0 was injected into the ingress TC hook.

tunl0:0: flags=193<UP,RUNNING,NOARP>  mtu 1480
        inet 10.10.10.10  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)

# tc filter show dev tunl0:0 ingress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 cil_from_netdev-tunl0 direct-action not_in_hw id 2384 tag 9a04780ea6522a6c jited

This can be illustrated as follows:

----------------------------------------------
| tunl0:0 (10.10.10.10) | cil_from_netdev-tunl0 |
-----------------------------------------------
| eth0 (192.168.0.100) | cil_from_netdev-eth0 |
-----------------------------------------------
outer: LB Real IP -> 192.168.0.100
inner: ClientIP   -> 10.10.10.10

As we anticipated, the IPIP device performs decapsulation, and the ingress TC hook that follows handles the service. I've confirmed successful end-to-end testing using IPIP external load balancers.

Nice! So you are saying that this approach seems to work fine then.

The original goal was to archive the above results using #31213. However, an error occurred when attempting to attach the eBPF program to the cilium_ipip4 device created using --enable-ipip-termination. The following error occurs in reloadHostDatapath() in loader.go

Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=466 error="loading eBPF collection into the kernel: inlining global data: can't override non-existent constant \"ETH_HLEN\"" identity=1 ipv4= ipv6= k8sPodName=/ subsys=endpoint

Could you push the draft PR somewhere? I might not get to it this week, but potentially next week I can also take a look and help.

It seems likely that by modifying the code to pass ebpf.CollectionSpec, I could achieve successful results with the cilium_ipip{4,6} devices created by #31213. However, since I have no prior experience working with eBPF spec code, it might take me few more weeks to complete the implementation.

Therefore, I would propose proceeding in the following order:

  1. Merge this PR: This is because ENABLE_EXTLB_IPIP_TERMINATION is merely a build flag, which requires explicit define to be used. This will not be exposed as a configurable option. As a result, it does not affect other datapath logic while providing significant value to other Cilium users who have been waiting for this feature. (as we discussed, this feature performs just decapsulation and does not do service lookup)
  2. Continue working on the PoC mentiond above. Enable the cil_from_netdev program to be loaded onto cilium_ipip{4,6}
  3. Create a helm value for --enable-ipip-termination and develop the related functionality and documentation.

This means we are adding this PR and removing it again once the other approach lands? From above it sounds like the one with ipip device seems quite close to be working, no?

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Nov 28, 2024

@borkmann

Could you push the draft PR somewhere? I might not get to it this week, but potentially next week I can also take a look and help.

When an arbitrary IPIP device (tunl0:0) is added in version v1.16.3 and the cilium-agent is restarted, the reloadHostDatapath() device list includes tunl0:0. Therefore, no additional code modifications are required up to this PoC stage. In reloadHostDatapah(), cil_from_netdev-tunl0 is injected. Also the error can be reproduced by using --enable-ipip-termination as a cilium-agent parameter. I'm glad to hear that you'll take a look.

This means we are adding this PR and removing it again once the other approach lands? From above it sounds like the one with ipip device seems quite close to be working, no?

It's true that significant progress has been made, but unexpected issues or edge cases might still arise, potentially delaying the timeline(something we often encounter). As an alternative, I suggested adding this PR now to benefit Cilium users who are waiting for it, and removing it later. To make this clear, I’ve also included this explanation in the commit message.

Note: This feature is expected to be deprecated in the future
by #31213

Additionally, this PR includes datapath IPIP packet test code. It will serve as a valuable reference for other contributors.

In the coming weeks, I will have limited availability as I focus on handling my company's backlog. However, I will continue analyzing the eBPF spec code and will share any findings as soon as I uncover them. I would greatly appreciate it if you could positively consider merging this PR.

@joestringer
Copy link
Member

It seems like we have alignment about how this feature should be written, that is, piggybacking on the ipip infrastructure. Given that, I see no need to merge a version that is not adhering to that vision. As an open source project we need to be constantly vigilant about the burden that we accrue around technical debt, and ensure that the functionality we are merging is going to be maintainable going forward. If this code is useful already as-is for your use case, then that's awesome! I would encourage you to continue to iterate and work with the reviewers to get the feature into a state that meets the quality bar we expect in Cilium.

@gyutaeb
Copy link
Contributor Author

gyutaeb commented Dec 12, 2024

@borkmann @joestringer Thank you for the great feedback and for trusting me with the opportunity to piggy-back on this. I will work on resolving the mentioned eBPF spec code issue soon. As it's the end of the year and I'm handling some backlog. I would greatly appreciate your patience.

@borkmann
Copy link
Member

@borkmann @joestringer Thank you for the great feedback and for trusting me with the opportunity to piggy-back on this. I will work on resolving the mentioned eBPF spec code issue soon. As it's the end of the year and I'm handling some backlog. I would greatly appreciate your patience.

Thanks a lot @gyutaeb this all sounds good to me, I think we can probably close this one here for now and start with a fresh pr. I'll try to see if I can also hack sth together next week as a base if I find some cycles in between, will let you know.

@borkmann borkmann closed this Dec 12, 2024
@gyutaeb
Copy link
Contributor Author

gyutaeb commented Jan 21, 2025

Hi, @borkmann @joestringer.
I'd like to reach out to share the progress of my development. I am currently analyzing why bpf_host.o is not being loaded into cilium_ipip devices. The issue occurs because cilium_ipip is not classified as a native devices, which prevents the loading of bpf_host.o. To address this, I am working on modifying the loader. I will share the PR with you once it's ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/community-contribution This was a contribution made by a community member. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants