Skip to content

Conversation

PhilipSchmid
Copy link
Contributor

@PhilipSchmid PhilipSchmid commented Jan 6, 2025

Talos Linux's Forwarding kube-dns to Host DNS functionality doesn't work together with Cilium's BPF host routing (Linux kernel 5.10+ and kubeProxyReplacement=true and bpf.masquerade=true).

It might have worked unintentionally in some scenarios before Cilium 1.16.5, but with the fix from #35098, it stopped without also configuring bpf.hostLegacyRouting=true.

doc: Added hostLegacyRouting limitation for Talos

@PhilipSchmid PhilipSchmid requested a review from a team as a code owner January 6, 2025 20:43
@PhilipSchmid PhilipSchmid requested a review from a user January 6, 2025 20:43
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 6, 2025
Comment on lines 64 to 68
When using tunneling mode and enabling `Kube-Proxy replacement<kubeproxy-free>`
together with Talos' `Forwarding kube-dns to Host DNS`_ (enabled by default
since Talos 1.8+), you need to set ``bpf.hostLegacyRouting`` to ``true`` as
DNS won't work otherwise. This behaviour was introduced in Cilium 1.16.5+.
See `cilium/cilium#35098`_ for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we break this configuration in a newer release? Trying to understand whether this is something we should be documenting or fixing. Maybe @jschwinger233 has more context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#35098 fixed a bug been there since 1.14, it never meant to break anything, unless something counting on that bug 😬

I don't have much knowledge about cilium with talos, but I have a feeling that, what we really want to address is to support cilium-talos with bpf host routing. They worked together in the past accidentally based on a bug, now that the bug is gone, we should figure out what's missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the impression this is more about the "node-local DNS" use case than Talos in general.

Copy link
Member

@joestringer joestringer Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting the feeling from some of the reports around this issue that users were relying on the bug fixed by #35098, and the new bug now revealed is worse than the one you had fixed. Even if ultimately the decision is "users shouldn't be running bpf host routing yet because we're still working on stability in some cases", as a user it doesn't feel good if the project asks you to make some change to their configuration in a bugfix release because one of the other bugfixes actually broke the environment.

One good piece of news is that with this increased feedback, we are more likely to be able to create a reproducer test case to detect the new bug, and avoid regression in this area in future.

That said, if we think the current state is more severe than what we had before, it may be worth reverting #35098 to bring back the previous behavior and only roll out that fix in v1.17 along with changes to the recommended configuration to ensure that users don't trigger the newly discovered bug during upgrade (+ release note).

Of course if we think we have a good understanding of the latest report and can come up with a fix for that, then an alternative option is just to work towards that and apply that fix everywhere that we applied #35098.

@joestringer joestringer added release-note/misc This PR makes changes that have no direct user impact. needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Jan 7, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 7, 2025
@joestringer joestringer added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/misc This PR makes changes that have no direct user impact. labels Jan 7, 2025
@dajrivera
Copy link

This issue is not limited to Talos: #36761

@PhilipSchmid PhilipSchmid changed the title doc: Added hostLegacyRouting warning for Talos doc: Added hostLegacyRouting limitation for Talos Jan 7, 2025
@PhilipSchmid PhilipSchmid force-pushed the pr/philip/hint_for_talos_installation branch from bfa40e6 to add9762 Compare January 7, 2025 20:01
@PhilipSchmid
Copy link
Contributor Author

Thanks all for the input!

@joestringer, @julianwiedmann: WDYT about the newest version where I only address the Talos-specific limitation of using "Forwarding kube-dns to Host DNS" together with Cilium's BPF host routing? bpf.masquerade is disabled by default and we don't recommend enabling it by default. Hence, I didn't add bpf.masquerade=true and bpf.hostLegacyRouting to the helm install example command. Because this is only for Talos and not node-local-dns, I also removed the "fixes 36737" thingy.

@julianwiedmann julianwiedmann added the needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch label Jan 8, 2025
Talos Linux's 'Forwarding kube-dns to Host DNS' functionality doesn't
work together with Cilium's BPF host routing.

Signed-off-by: Philip Schmid <phisch@cisco.com>
@PhilipSchmid PhilipSchmid force-pushed the pr/philip/hint_for_talos_installation branch from add9762 to 5db0821 Compare January 8, 2025 08:34
@julianwiedmann julianwiedmann added area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Jan 8, 2025
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you!

@julianwiedmann
Copy link
Member

/test

@joestringer joestringer enabled auto-merge January 8, 2025 18:16
@joestringer joestringer removed the request for review from a user January 8, 2025 18:17
@joestringer joestringer added this pull request to the merge queue Jan 8, 2025
Merged via the queue into cilium:main with commit 34e296e Jan 8, 2025
59 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 8, 2025
@rastislavs rastislavs mentioned this pull request Jan 21, 2025
45 tasks
@rastislavs rastislavs added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 21, 2025
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. labels Jan 22, 2025
@rastislavs rastislavs mentioned this pull request Jan 22, 2025
19 tasks
@rastislavs rastislavs added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Jan 22, 2025
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants