Skip to content

Conversation

tsotne95
Copy link
Contributor

@tsotne95 tsotne95 commented Jun 17, 2025

On some cloud platforms like GCE, IPs for external Load Balancers are not assigned directly to a network interface.
Instead, they are added to the node's local routing table to make the kernel "own" the IP. The existing node
address detection logic, which only scanned device addresses, could not see these IPs, making them unavailable for
features like NodePort.

This commit enhances the node address detection logic to discover these IPs directly from the routing table,
making this behavior the default.

The node-address controller now inspects the local routing table for host-scoped routes. To ensure consistency,
an address from a route is only considered if the underlying device is suitable for node addressing. This device
filtering logic is consolidated in a new shouldUseDeviceForNodeAddress helper, which checks if a device is "up"
and "selected", while making specific exceptions for "dummy" devices (to support use cases like nodelocaldns)
and other whitelisted interfaces.

Key implementation details:

  • The logic avoids creating duplicate addresses if an IP from a route is already known from a device.
  • The NodePort eligibility for these route-based IPs is determined by the existing --nodeport-addresses
    whitelist.
  • If the whitelist is not set, route-based IPs are not marked for NodePort by default.

With this change, Cilium can now correctly use GCE Load Balancer IPs for NodePort services out-of-the-box.

test: Add new unit test TestNodeAddressFromRoute.

The new test validates the route-based discovery logic under several conditions, including when NodePort whitelists
match or don't match the discovered IP, and ensures no duplicates are created.

@tsotne95 tsotne95 requested review from a team as code owners June 17, 2025 12:36
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 17, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jun 17, 2025
@tsotne95 tsotne95 force-pushed the pr/fix-24481-NodePort branch 2 times, most recently from 365f5b7 to 75084a6 Compare June 17, 2025 12:55
@joestringer joestringer added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jun 20, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 20, 2025
@joestringer joestringer added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jun 20, 2025
@tsotne95 tsotne95 force-pushed the pr/fix-24481-NodePort branch from 75084a6 to 90d837c Compare June 24, 2025 18:00
@joestringer joestringer removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jul 1, 2025
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helm looks good to me! Thanks!

@pchaigno pchaigno requested a review from joamaki July 19, 2025 08:52
@pchaigno pchaigno enabled auto-merge July 19, 2025 08:53
@pchaigno pchaigno removed the request for review from gentoo-root July 19, 2025 08:54
auto-merge was automatically disabled July 21, 2025 08:01

Head branch was pushed to by a user without write access

@tsotne95 tsotne95 force-pushed the pr/fix-24481-NodePort branch from 90d837c to 79c559c Compare July 21, 2025 08:01
@joamaki
Copy link
Contributor

joamaki commented Jul 21, 2025

/test

@aanm aanm force-pushed the pr/fix-24481-NodePort branch from 3c8bdd2 to f71fcec Compare July 30, 2025 09:46
@tsotne95
Copy link
Contributor Author

/test

@tsotne95 tsotne95 force-pushed the pr/fix-24481-NodePort branch from f71fcec to 5ad0b13 Compare July 30, 2025 11:11
@tsotne95
Copy link
Contributor Author

/test

@tsotne95
Copy link
Contributor Author

tsotne95 commented Jul 30, 2025

same issues #40179 #31327 #40800

definitely, they weren't caused by my change.

1 test form ginkgo failed because #31040

@pchaigno pchaigno requested a review from joamaki July 30, 2025 15:09
@joestringer
Copy link
Member

@tsotne95 if you're referencing an issue that is more than a year old, especially if the issue is marked as "resolved", I would suggest filing a fresh issue for those failures. For instance you referred to #31040 . For that issue, even if the symptom is the same, we already applied a fix to the tree well over a year ago. Therefore if you're experiencing the problem, there may have been a regression, or it could be that something in this PR triggers the failure in ways that are not affecting the main branch. I understand that it can be frustrating to see a failure that appears to be unrelated to your PR, but in order to make that better for everyone, we aim to track down each failure case, prioritize, and fix those sources of unreliability. Referring to older closed issues will not push the testing reliability in the right direction.

@joestringer
Copy link
Member

Some more context here: Over the past couple of weeks I've been trying to help various contributors experiencing inconsistent test results, and naturally for any individual contributor it feels like the tests are giving poor feedback. If we can find the shortest path to ignore those failures it will unblock the progress on the individual PR. But in several cases we have merged PRs that increase the background failure rate, which then subsequently impact other PRs. If we follow the same pattern, we will constantly face the same problems. I'm aiming to set a higher bar for communication around these issues so that we face them head on and make a conscious decision how to handle them. If the test is unreliable on main, let's coordinate, investigate, come up with a proposal to improve it. If the test does not show evidence of unreliability on main then let's dig in on the PR, figure out what is making the test hard to debug, improve the output, improve the reliability of the test case to ensure it is validating the functionality in a reasonable way.

@tsotne95
Copy link
Contributor Author

tsotne95 commented Jul 30, 2025

if you're referencing an issue that is more than a year old, especially if the issue is marked as "resolved", I would suggest filing a fresh issue for those failures.

@joestringer
Thank you for the clarification. That makes perfect sense. My aim was to avoid a duplicate by referencing the old issue (as it was the exact same issue), but I now understand the importance of filing a fresh ticket for tracking potential regressions. I will be sure to do so going forward.

@joestringer
Copy link
Member

Given the background of test failure rate on this PR, I'd feel more confident if we ran the tests again. Note that some of the failures last time were caused by #40801 , and a fix was merged around five hours ago. Rebasing should eliminate that failure cause.

On some cloud platforms like GCE, IPs for external Load Balancers are not assigned directly to a network interface.
Instead, they are added to the node's `local` routing table to make the kernel "own" the IP. The existing node
address detection logic, which only scanned device addresses, could not see these IPs, making them unavailable for
features like NodePort.

This commit enhances the node address detection logic to discover these IPs directly from the routing table,
making this behavior the default.

The node-address controller now inspects the `local` routing table for host-scoped routes. To ensure consistency,
an address from a route is only considered if the underlying device is suitable for node addressing. This device
filtering logic is consolidated in a new `shouldUseDeviceForNodeAddress` helper, which checks if a device is "up"
and "selected", while making specific exceptions for "dummy" devices (to support use cases like `nodelocaldns`)
and other whitelisted interfaces.

Key implementation details:
- The logic avoids creating duplicate addresses if an IP from a route is already known from a device.
- The `NodePort` eligibility for these route-based IPs is determined by the existing `--nodeport-addresses`
whitelist.
- If the whitelist is not set, route-based IPs are not marked for NodePort by default.

With this change, Cilium can now correctly use GCE Load Balancer IPs for NodePort services out-of-the-box.

test: Add new unit test `TestNodeAddressFromRoute`.

The new test validates the route-based discovery logic under several conditions, including when NodePort whitelists
match or don't match the discovered IP, and ensures no duplicates are created.

Fixes: cilium#24481

Signed-off-by: Tsotne Chakhvadze <tsotne@google.com>
@tsotne95 tsotne95 force-pushed the pr/fix-24481-NodePort branch from 5ad0b13 to 6df1af1 Compare July 31, 2025 09:36
@tsotne95
Copy link
Contributor Author

/test

@pchaigno pchaigno removed the dont-merge/waiting-for-review Requires further review before merging. label Jul 31, 2025
@joamaki joamaki enabled auto-merge July 31, 2025 13:00
@joamaki joamaki added dont-merge/preview-only Only for preview or testing, don't merge it. and removed dont-merge/preview-only Only for preview or testing, don't merge it. labels Jul 31, 2025
@joamaki joamaki added this pull request to the merge queue Jul 31, 2025
Merged via the queue into cilium:main with commit a1c557e Jul 31, 2025
68 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. upgrade-impact This PR has potential upgrade or downgrade impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants