-
Notifications
You must be signed in to change notification settings - Fork 3.4k
IPIP Termination: Chapter 1 #33026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP Termination: Chapter 1 #33026
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loader changes seem benign. Seems like the PR is missing tests though?
@lmb Thank you for the review! Could you please clarify "tests" - if you meant a result of the existing test codes or a new test code for this PR? In case you meant a new test code, I haven't added the test code for this PR yet. I've split the PR into chapter 1. I was going to add the bpf test code in the next chapter. Would that be okay? |
How can we tell that the code we're merging is actually correct? Having tests is a requirement. |
@lmb Okay. I am going to add test code to this PR. Thanks for letting me know :) |
4e99c55
to
03774eb
Compare
Correcting author & committer in 03774eb |
Hello, we do have --enable-ipip-termination in the agent already. Could you piggy-back on this instead of reimplementing? (#31213) |
657e269
to
3ada7b9
Compare
I've already reviewed the part of the code I'm familiar with. @gentoo-root is responsible for the overall review here. |
It turns out there is similar issue already opened #32473 with similar regression regarding kube-proxy replacement. |
I believe #30547 PR from @borkmann and this PR are similar but have structural differences. #30547 checks the inner tuple in Additionally, #30547 seems to support DSR for pods, where IPIP-terminated packets reach the pod, and the pod performs direct server return. In contrast, this PR does not involve the pod's direct server return. IPIP-terminated packets are NATed and unNATed by the kube-proxy logic. |
17adf0a
to
8577905
Compare
@gentoo-root ping? :) Could you review this PR? 🙏 |
ea02ddd
to
c4d9f5f
Compare
@joestringer Could you please run |
/test |
return DROP_INVALID; | ||
} | ||
|
||
return CTX_ACT_OK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decapsulation occurs when the inner IP header and l4 tuple are matched the kube-proxy services. The goal of this decapsulation is cilium's kube-proxy can handle ipip tunnel traffic.
Sorry for the big delay/backlog on my side. What is the rationale for the above? I'm mainly asking as this means service traffic needs to do the service lookup cost twice in the fast path, first in decap_ipip() and then later on once again in the regular service handling code.
I'm still somewhat in favor of just attaching BPF to the cilium_ipip device and let it sort out the rest of the service handling. So it will decap traffic that hits the node no matter what and if some of it is service traffic then we handle it too, similarly for other policies outside of services. In this here we're basically just pushing these IPIP encapped up the stack, I wonder if this seems valid.
Iff the similar route as with vxlan/geneve doesn't work out, maybe we could revisit, but also unconditionally just decap - what speaks against that? Btw, from kernel side I was looking into GRO/GSO handling, I think at least the GRO part is independent of device or no device as the GRO engine should just look at the headers, aggregate, and then send the bigger pkt up. For GSO, I haven't checked yet, did you look into it? Otoh, here we also only terminate, not encap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @borkmann :)
What is the rationale for the above? I'm mainly asking as this means service traffic needs to do the service lookup cost twice in the fast path, first in decap_ipip() and then later on once again in the regular service handling code.
I agree that decapsulation should work independently(no lookup service). The reason for doing the service lookup in decap_ipip()
was a defensive programming. It was designed to protect against cases where K8s cluster nodes are used not only for K8s purposes but also for other uses. This includes situations where users run applications directly on the nodes and use tunnel devices. While this configuration is not ideal, the focus was on preventing potential conflicts as much as possible. (I realize this might be considering too many edge cases.)
I also looked into the kernel GRO part. Just as GRO is independent of the device, removing the dependency on service lookup in decap_ipip()
could be a good alternative. This approach would involve performing unconditionally decapsulation. How about improving the PR this way? It would function similarly to cilium_ipip
, without dependencies on other parts.
To summarize the overall plan, it is as follows:
ENABLE_EXTLB_IPIP_TERMINATION
supports decapsulation ipip without service lookup.- Preparing to piggy-back
--enable-ipip-termination
. (cilium: Enable plain IPIP/IP6IP6 termination #31213)
2.1. Implementing the existing bpf_overlay hook to be usable with tunnel devices.
2.2. Modifyingcilium-agent
to allow it to attach newbpf_overlay
. - Deprecated item 1.
It would be a good way to move over to using --enable-ipip-termination
meanwhile we deprecated ENABLE_EXTLB_IPIP_TERMINATION
. And if it's okay, I would like to continue working on piggy-back #31213
For your reference, the development for IPv6 has been completed. Since IPv6 is based on the partial of this PR, it would be better to create a new PR once this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@borkmann @gentoo-root ping? 😄 Could you please review this conversation and approach? I'm always grateful, thank you :)
I also looked into the kernel GRO part. Just as GRO is independent of the device, removing the dependency on service lookup in decap_ipip() could be a good alternative. This approach would involve performing unconditionally decapsulation. How about improving the PR this way? It would function similarly to cilium_ipip, without dependencies on other parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay on my side, conference travels and too much backlog :/
Preparing to piggy-back --enable-ipip-termination. (#31213)
2.1. Implementing #33026 (comment) hook to be usable with tunnel devices.
2.2. Modifying cilium-agent to allow it to attach new bpf_overlay.
I think that sounds good, hopefully in the meantime you managed to do a PoC to see if this worked out?
@gentoo-root this is waiting for your approval. Could you take another look? |
I understood from discussion in the community meeting yesterday that the next step was for @gyutaeb to report back about the PoC in order to resolve the discussion in this thread. If @borkmann approves then we shouldn't need @gentoo-root to take another look. |
@borkmann @derailed @joestringer @gentoo-root |
@borkmann Thank you for your patience. Let me share the PoC results. The tests were conducted on the stable version, v1.16.3. An arbitrary
This can be illustrated as follows:
As we anticipated, the IPIP device performs decapsulation, and the ingress TC hook that follows handles the service. I've confirmed successful end-to-end testing using IPIP external load balancers. The original goal was to archive the above results using #31213. However, an error occurred when attempting to attach the eBPF program to the
It seems likely that by modifying the code to pass Therefore, I would propose proceeding in the following order:
|
eda6050
to
141ca3d
Compare
@joestringer It's been a while since I last reached out. Could you please run |
/test |
When ENABLE_EXTLB_IPIP_TERMINATION is used as a build flag, cil_from_netdev() decapsulate ipip tunnel traffic. The goal of this decapsulation is cilium's kube-proxy can handle ipip tunnel traffic. A test verifies that ENABLE_EXTLB_IPIP_TERMINATION works as expected. It creates an IPIP tunnel packet between loadbalancer and node. And verifies that the packet is correctly terminated in cil_from_netdev. Note: This feature is expected to be deprecated in the future by cilium#31213 Signed-off-by: Gyutae Bae <gyu.8ae@gmail.com>
141ca3d
to
0cf501c
Compare
@gentoo-root @joestringer Sorry for the failed tests. I encountered an issue while validating the commit myself. I've cleaned up unused variables and made changes to ensure the tests pass. Whenever you have time, could you please run the |
/test |
Nice! So you are saying that this approach seems to work fine then.
Could you push the draft PR somewhere? I might not get to it this week, but potentially next week I can also take a look and help.
This means we are adding this PR and removing it again once the other approach lands? From above it sounds like the one with ipip device seems quite close to be working, no? |
When an arbitrary IPIP device (
It's true that significant progress has been made, but unexpected issues or edge cases might still arise, potentially delaying the timeline(something we often encounter). As an alternative, I suggested adding this PR now to benefit Cilium users who are waiting for it, and removing it later. To make this clear, I’ve also included this explanation in the commit message.
Additionally, this PR includes datapath IPIP packet test code. It will serve as a valuable reference for other contributors. In the coming weeks, I will have limited availability as I focus on handling my company's backlog. However, I will continue analyzing the eBPF spec code and will share any findings as soon as I uncover them. I would greatly appreciate it if you could positively consider merging this PR. |
It seems like we have alignment about how this feature should be written, that is, piggybacking on the ipip infrastructure. Given that, I see no need to merge a version that is not adhering to that vision. As an open source project we need to be constantly vigilant about the burden that we accrue around technical debt, and ensure that the functionality we are merging is going to be maintainable going forward. If this code is useful already as-is for your use case, then that's awesome! I would encourage you to continue to iterate and work with the reviewers to get the feature into a state that meets the quality bar we expect in Cilium. |
@borkmann @joestringer Thank you for the great feedback and for trusting me with the opportunity to piggy-back on this. I will work on resolving the mentioned eBPF spec code issue soon. As it's the end of the year and I'm handling some backlog. I would greatly appreciate your patience. |
Thanks a lot @gyutaeb this all sounds good to me, I think we can probably close this one here for now and start with a fresh pr. I'll try to see if I can also hack sth together next week as a base if I find some cycles in between, will let you know. |
Hi, @borkmann @joestringer. |
This is in the context of the lookup ipip tunnel feature described at #33020.

This pull request extends the datapath to be able to decapsulate the IPIP tunnel, when the inner IP header and l4 tuple are matched the kube-proxy services.