Skip to content

Fix dropped NodePort traffic to hostNetwork backends with Geneve+DSR #36978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

tommasopozzetti
Copy link
Contributor

@tommasopozzetti tommasopozzetti commented Jan 14, 2025

This PR fixes DSR dispatch with Geneve to hostNetworked pods by ensuring tunnel_endpoint is always populated in ipcache, even for directly reachable endpoints. Unneeded encapsulation is avoided by passing the skiptunnel flag for such endpoints.

Fixes: #36901

@tommasopozzetti tommasopozzetti requested review from a team as code owners January 14, 2025 21:13
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 14, 2025
@github-actions github-actions bot added cilium-cli This PR contains changes related with cilium-cli kind/community-contribution This was a contribution made by a community member. labels Jan 14, 2025
@julianwiedmann julianwiedmann self-requested a review January 15, 2025 07:43
@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. labels Jan 15, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 15, 2025
@julianwiedmann julianwiedmann added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 15, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 15, 2025
@julianwiedmann
Copy link
Member

Thank you! I'll have a closer look later.

Also queued up #36982. Once that's merged, we can rebase this PR so that we get immediate feedback on the connectivity test.

Copy link
Contributor

@Artyop Artyop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@julianwiedmann
Copy link
Member

Also queued up #36982. Once that's merged, we can rebase this PR so that we get immediate feedback on the connectivity test.

Merged, I'll auto-rebase the branch...

@julianwiedmann julianwiedmann force-pushed the pr/fix-nodeport-to-hostnetns-dsr-geneve branch from ce585d8 to aa75d28 Compare January 15, 2025 14:40
@julianwiedmann
Copy link
Member

/ci-e2e-upgrade

@tommasopozzetti
Copy link
Contributor Author

Thanks for the quick look!
I see some connectivity test cases failed waiting for the new pod that gets introduced though not 100% sure how to go check more details about those setups. Any pointer would be appreciated, first time going through the cilium CI process! 😄

@tommasopozzetti
Copy link
Contributor Author

@julianwiedmann fixed the conflict. Any tips on next steps to get this in?

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments inline, before I forgot about the CLI compatibility aspect again ...

@julianwiedmann
Copy link
Member

@julianwiedmann fixed the conflict. Any tips on next steps to get this in?

In general - please rebase on-top of main, instead of merging main into your branch.

I see some connectivity test cases failed waiting for the new pod that gets introduced though not 100% sure how to go check more details about those setups.

IIRC the CLI tests for cluster-internal connectivity to all the nodeport services on startup, before running the connectivity tests. Is that what's failing? Let's re-run the workflow and find out.

@tommasopozzetti
Copy link
Contributor Author

@dylandreimerink Thanks! Done!
Hopefully now we can get it in before it lags too much behind again 😅

@dylandreimerink
Copy link
Member

/test

Copy link
Contributor

@doniacld doniacld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but asking peers from cilium/ipcache to have a second look regarding the addition of a new field to IPCache. cc @squeed

@doniacld
Copy link
Contributor

LGTM but asking peers from cilium/ipcache to have a second look regarding the addition of a new field to IPCache. cc @squeed

@squeed mentioned that's fine to add a new field to IPCache since we are mapping with the BPF side.

@joestringer joestringer added affects/v1.16 This issue affects v1.16 branch and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch cilium-cli This PR contains changes related with cilium-cli labels Feb 10, 2025
@joestringer
Copy link
Member

Thanks for the fix! Merging.

@joestringer joestringer added this pull request to the merge queue Feb 10, 2025
Merged via the queue into cilium:main with commit e8db5d3 Feb 10, 2025
66 checks passed
@nbusseneau nbusseneau mentioned this pull request Feb 14, 2025
22 tasks
@nbusseneau nbusseneau added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Feb 14, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Feb 20, 2025
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Mar 3, 2025
…onfig

With cilium#36978 the IPcache entry for
a remote node now contains a tunnel_endpoint, typically in combination
with the `skip_tunnel` flag set. But when dealing with replies for a
connection that was established via the overlay network, we want to
continue routing those replies via the overlay.

Also fix up the corresponding BPF test.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Mar 3, 2025
With cilium#36978 the IPcache entry for
a remote node now contains a tunnel_endpoint, typically in combination
with the `skip_tunnel` flag set. But when dealing with replies for a
connection that was established via the overlay network, we want to
continue routing those replies via the overlay.

Also fix up the corresponding BPF test.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Mar 9, 2025
This reproduces disruption for an established pod-to-remote-node
connection, caused by the changes in
cilium#36978.

In old code such a connection would be masqueraded, but in new code the
masquerading is skipped. Thus the destination will see requests with
different source IPs.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann
Copy link
Member

julianwiedmann commented Mar 9, 2025

Sorry, I think we'll have to revert this from v1.17. I believe it causes a regression, as described here. We can keep it in main for now if that helps.

@julianwiedmann
Copy link
Member

julianwiedmann commented Mar 9, 2025

I've queued up the revert here.

I actually like the approach of always setting the tunnel_endpoint, and thus associating an endpoint with its node's "primary" IP. I don't think we've had such an association before, and it feels beneficial. It even allows for a subsequent lookup in the node_map, and thus a decision whether two remote endpoints live on the same remote node.

But the way we currently handle pod-to-remote-node traffic (SNAT instead of tunnel), along with how the skip_tunnel flag was initially intended, makes this harder to implement. I'd be happy to give this a bit more thought & experimentation (post-PTO :)), I'm optimistic we can make it work and end up with a better overall design for pod-to-remote-node connections.

@tommasopozzetti
Copy link
Contributor Author

@julianwiedmann sorry to hear that!
Fixing the original issue here is still very much critical to us as without this fix there still is a significant issue with dropped traffic with that configuration!
I’d be happy to help tackle it differently in the meantime if you have suggestions or contribute to the solution any way I can

@tommasopozzetti
Copy link
Contributor Author

I’m also curious if you could expand (or point to docs) on the original intent of that flag
Reading through the code block you pointed out causes the issue with the pod to remote node traffic it’s not clear to me why the presence of the skip_tunnel flag would skip SNAT. Considering the current implementation never uses overlay tunneling for that communication to begin with, why would it need a different behavior depending on the value of that flag?

julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Mar 17, 2025
…nfig

With cilium#36978 the IPcache entry for
a remote node now contains a tunnel_endpoint, typically in combination
with the `skip_tunnel` flag set. But when dealing with replies for a
connection that was established via the overlay network, we want to
continue routing those replies via the overlay.

Also fix up the corresponding BPF test.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Mar 18, 2025
…unnel

With cilium#36978 the IPcache entry for
a remote node now contains a tunnel_endpoint, typically in combination
with the `skip_tunnel` flag set. But when dealing with replies for a
connection that was established via the overlay network, we want to
continue routing those replies via the overlay.

Also fix up the corresponding BPF test for inter-cluster-SNAT.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 18, 2025
…unnel

With #36978 the IPcache entry for
a remote node now contains a tunnel_endpoint, typically in combination
with the `skip_tunnel` flag set. But when dealing with replies for a
connection that was established via the overlay network, we want to
continue routing those replies via the overlay.

Also fix up the corresponding BPF test for inter-cluster-SNAT.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.16 This issue affects v1.16 branch area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. feature/dsr Relates to Cilium's Direct-Server-Return feature for KPR. kind/community-contribution This was a contribution made by a community member. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cilium with geneve and DSR drops external traffic to host network backends