Skip to content

Conversation

jrife
Copy link
Contributor

@jrife jrife commented Feb 18, 2025

In some environments, neighbor entries for non-cluster hosts remain stale forever, breaking connectivity in cases where their IP is reassigned to a new host with a different MAC address. This isn't a problem for hosts within the cluster; Cilium sets the managed and extern_learn flags for those neighbor entries, so the kernel actively refreshes them.

In our particular case, we have a set of VMs forming our Kubernetes cluster running a NodePort service and another VM in the same L2 domain talking to that service. This set of VMs is upgraded as a unit by spinning up new VMs, draining the old ones, and reassigning the IP addresses, although not necessarily in that order. At some point during this process, the MAC address assigned to the IP for that external VM changes. The network switches in our setup enable ARP suppression which, as I understand it, suppresses GARPs if the ARP suppression cache on the switch already has an entry for a given MAC address. This, combined with Cilium's current behavior leads to a situation where we we need to apply some workarounds to get things talking again. What we hope would occur is that cluster nodes would send an ARP request for any STALE neighbor entries in their ARP table, but currently this never happens.

fib_redirect() and others call bpf_fib_lookup() and pass its result along to fib_do_redirect(). Currently, fib_do_redirect() only uses bpf_redirect_neigh() if the FIB lookup was unsuccessful, returning BPF_FIB_LKUP_RET_NO_NEIGH. Otherwise, it "manually" populates the L2 header of the packet and uses bpf_redirect() to send it out through the host interface. However, bpf_fib_lookup() returns BPF_FIB_LKUP_RET_SUCCESS regardless of the state of the neighbor entry, and bpf_redirect() doesn't trigger a neighbor event in the kernel. The net result is that stale entries can remain stale which is problematic in setups like the one described above.

  • The first commit makes it so Cilium always uses bpf_redirect_neigh() when available to trigger refresh of stale entries and avoid corner cases like ours where invalid entries can can remain stale forever.

  • The last two commits use the BPF_FIB_LOOKUP_SKIP_NEIGH [1] flag in preceding calls to bpf_fib_lookup() to avoid duplicate work in cases where we want to defer to bpf_redirect_neigh() for the neighbor resolution. The first commit introduces a probe that ensures that BPF_FIB_LOOKUP_SKIP_NEIGH is supported on the current system, as some older kernels may support bpf_fib_lookup() but not the BPF_FIB_LOOKUP_SKIP_NEIGH flag. The second commit plumbs through the probe result to let the FIB code decide whether or not to set it.

[1]: bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup

Without This Change

The neighbor entry for 192.168.4.255, a VM in the same L2 domain as my cluster nodes, remains STALE after making an HTTP request to a NodePort service hosted on that node.

jordan@crow:~$ ip neigh | grep 225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE
...
jordan@crow:~$ ip neigh | grep 225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE

With This Change

The HTTP request from 192.168.4.255 triggers a neighbor event on the cluster node, so the neighbor entry transitions from STALE to REACHABLE.

jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE
...
jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 DELAY
...
jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 REACHABLE

Fixes: #34503

bpf: fib: Fix issue where neighbor entries remain stale forever in some cases.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 18, 2025
@jrife jrife force-pushed the jrife/always-use-bpf_redirect_neigh branch from a97bbea to b342aed Compare February 18, 2025 21:59
@jrife jrife marked this pull request as ready for review February 18, 2025 22:28
@jrife jrife requested review from a team as code owners February 18, 2025 22:28
@jrife jrife requested a review from ysksuzuki February 18, 2025 22:28
@jrife jrife force-pushed the jrife/always-use-bpf_redirect_neigh branch 2 times, most recently from e72bf8a to d176280 Compare February 18, 2025 22:33
Copy link
Member

@ysksuzuki ysksuzuki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I have two questions.

  • Have you considered making this behavior, where bpf_redirect_neigh() is always used when available, opt-in instead of enabling it by default? Or have you already evaluated that the potential performance impact would be trivial?
  • Does this issue occur only when the destination is a host outside the cluster? If so, would it be worth considering an optimization where bpf_redirect_neigh() is always used only when the destination is outside the cluster?

@borkmann
Copy link
Member

borkmann commented Feb 19, 2025

In some environments, neighbor entries for non-cluster hosts remain stale forever, breaking connectivity in cases where their IP is reassigned to a new host with a different MAC address. This isn't a problem for hosts within the cluster; Cilium sets the managed and extern_learn flags for those neighbor entries, so the kernel actively refreshes them.

In our particular case, we have a set of VMs forming our Kubernetes cluster running a NodePort service and another VM in the same L2 domain talking to that service. This set of VMs is upgraded as a unit by spinning up new VMs, draining the old ones, and reassigning the IP addresses, although not necessarily in that order. At some point during this process, the MAC address assigned to the IP for that external VM changes. The network switches in our setup enable ARP suppression which, as I understand it, suppresses GARPs if the ARP suppression cache on the switch already has an entry for a given MAC address. This, combined with Cilium's current behavior leads to a situation where we we need to apply some workarounds to get things talking again. What we hope would occur is that cluster nodes would send an ARP request for any STALE neighbor entries in their ARP table, but currently this never happens.

fib_redirect() and others call bpf_fib_lookup() and pass its result along to fib_do_redirect(). Currently, fib_do_redirect() only uses bpf_redirect_neigh() if the FIB lookup was unsuccessful, returning BPF_FIB_LKUP_RET_NO_NEIGH. Otherwise, it "manually" populates the L2 header of the packet and uses bpf_redirect() to send it out through the host interface. However, bpf_fib_lookup() returns BPF_FIB_LKUP_RET_SUCCESS regardless of the state of the neighbor entry, and bpf_redirect() doesn't trigger a neighbor event in the kernel. The net result is that stale entries can remain stale which is problematic in setups like the one described above.

Always use bpf_redirect_neigh() if available to trigger refresh of stale entries and avoid corner cases like ours where invalid entries can can remain stale forever. Use BPF_FIB_LOOKUP_SKIP_NEIGH [1] in preceding calls to bpf_fib_lookup() to avoid duplicate work.

[1]: bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup

Hey Jordan, just a question for clarification/better understanding:

In the kernel's BPF fib lookup helper, we test nud_state for NUD_VALID (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/filter.c#n6180):

	if (!neigh || !(READ_ONCE(neigh->nud_state) & NUD_VALID))
		return BPF_FIB_LKUP_RET_NO_NEIGH;

The NUD_VALID is defined as (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/net/neighbour.h#n39):

#define NUD_VALID	(NUD_PERMANENT|NUD_NOARP|NUD_REACHABLE|NUD_PROBE|NUD_STALE|NUD_DELAY)

The kernel has a NEIGH_VAR_BASE_REACHABLE_TIME_MS which iiuc defines the period:

base_reachable_time (since Linux 2.2) Once a neighbor has been found, the entry is considered
to be valid for at least a random value between base_reachable_time/2 and 3*base_reachable_time/2.
An entry's validity will be extended if it receives positive feedback from higher level protocols.
Defaults to 30 seconds. This file is now obsolete in favor of base_reachable_time_ms.
base_reachable_time_ms (since Linux 2.6.12) As for base_reachable_time, but measures time in
milliseconds. Defaults to 30000 milliseconds.

Dumb q: Would it rather make sense to tweak base reachable time instead? Is that mostly for NUD_REACHABLE state or does it also relate to NUD_STALE somehow?

Are you saying with redirect_neigh() the NUD_STALE automatically triggers a new resolution to go back into NUD_REACHABLE?

Edit: found the state diagram:

Screenshot 2025-02-19 at 11 55 38

Ok, so in the kernel the transition from NUD_REACHABLE -> NUD_STALE does trigger re-resolution:

/* Neighbour state is suspicious;
   disable fast path.

   Called with write_locked neigh.
 */
static void neigh_suspect(struct neighbour *neigh)
{
	neigh_dbg(2, "neigh %p is suspected\n", neigh);

	WRITE_ONCE(neigh->output, neigh->ops->output);
}

@borkmann
Copy link
Member

Without This Change

The neighbor entry for 192.168.4.255, a VM in the same L2 domain as my cluster nodes, remains STALE after making an HTTP request to a NodePort service hosted on that node.

jordan@crow:~$ ip neigh | grep 225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE
...
jordan@crow:~$ ip neigh | grep 225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE

With This Change

The HTTP request from 192.168.4.255 triggers a neighbor event on the cluster node, so the neighbor entry transitions from STALE to REACHABLE.

jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE
...
jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 DELAY
...
jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 REACHABLE

Is it correct to assume that the above entries are the backend IPs? If yes, would this PR which is in main already have helped? #37352

@jrife
Copy link
Contributor Author

jrife commented Feb 19, 2025

@ysksuzuki,

Thank you for the PR! I have two questions.

  • Have you considered making this behavior, where bpf_redirect_neigh() is always used when available, opt-in instead of enabling it by default? Or have you already evaluated that the potential performance impact would be trivial?

Thank you for taking a look! In my view this should simply be the default behavior if there are no ill effects, as it works for a broader set of circumstances. I've run some iperf3 benchmarks at 100 Gbps and did not experience any noticeable changes in throughput or CPU utilization between this and the old approach. I will try to find the measurements somewhere and provide them here. I'm happy to run some additional benchmarks if you are not convinced though :).

  • Does this issue occur only when the destination is a host outside the cluster? If so, would it be worth considering an optimization where bpf_redirect_neigh() is always used only when the destination is outside the cluster?

The issue I experienced does, yes. Outside the cluster but in the same L2 domain as the cluster nodes. I believe it may further complicate things in fib.h to need to differentiate between cluster and non-cluster IPs as we invoke fib_do_redirect(). If there is no performance concern I think we should aim for the simplest approach.

@borkmann,

Dumb q: Would it rather make sense to tweak base reachable time instead? Is that mostly for NUD_REACHABLE state or does it also relate to NUD_STALE somehow?

If my read of the docs and code is correct, I'm not sure tweaking this solves the underlying issue. Seems like it just takes longer for the entry to go stale? A long enough timeout may mask the issue I described in one environment, but as with any timeout value, what works in one environment may not work for another.

Are you saying with redirect_neigh() the NUD_STALE automatically triggers a new resolution to go back into NUD_REACHABLE?

Right, any subsequent calls to neigh_output() after neigh_suspect() happens (when an entry goes stale) triggers neigh_resolve_output() as opposed to the usual neigh_connected_output(). neigh_resolve_output() triggers neigh_event_send() before jumping into dev_queue_xmit().

static const struct neigh_ops arp_generic_ops = {
	.family =		AF_INET,
	.solicit =		arp_solicit,
	.error_report =		arp_error_report,
	.output =		neigh_resolve_output,
	.connected_output =	neigh_connected_output,
};

static const struct neigh_ops arp_hh_ops = {
	.family =		AF_INET,
	.solicit =		arp_solicit,
	.error_report =		arp_error_report,
	.output =		neigh_resolve_output,
	.connected_output =	neigh_resolve_output,
};
static inline int neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
{
	return neigh_event_send_probe(neigh, skb, true);
}

Is it correct to assume that the above entries are the backend IPs? If yes, would this PR which is in main already have helped? #37352

To clarify, 192.168.4.225 is the IP of another VM in the same L2 domain as my cluster nodes but not part of the cluster itself. The neighbor entry exists on my cluster node (crow) as 192.168.4.225 talks to a node port service hosted on crow. I'm not sure #37352 helps here, as it seems to be targeting neighbor entries for backends managed by Cilium itself.

All in all, bpf_redirect_neigh() keeps neighbor entries fresh in addition to doing the neigh table lookup. By and large, Cilium keeps most relevant neighbor entries fresh (managed extern_learn for cluster nodes, etc.), but there are some corner cases lurking like this one. I was actually surprised to learn that Cilium doesn't already just default to bpf_redirect_neigh() wherever possible, as it seems like the behavior more closely matches what you'd get when using the kernel's stack. This could be my own ignorance talking though.

Are there performance reasons for the current preference for bpf_redirect() in most cases? I was concerned at first there might be a performance hit switching to bpf_redirect_neigh() in all cases due to the duplicated neighbor lookup (at least without BPF_FIB_LOOKUP_SKIP_NEIGH) in bpf_fib_lookup() and some of the extra work it has to do compared to bpf_redirect(). I was unable to detect any significant differences with 100 Gbps iperf3 test which I'm happy to share here, although that doesn't mean they don't exist :). If indeed we are concerned about every ounce of performance and bpf_redirect_neigh() is noticeably slower or worse in some dimension, perhaps we can be a bit more intelligent about when we use bpf_redirect_neigh() as @ysksuzuki hinted at above; however, I wanted to explore the simple approach first.

@ysksuzuki
Copy link
Member

To clarify, 192.168.4.225 is the IP of another VM in the same L2 domain as my cluster nodes but not part of the cluster itself. The neighbor entry exists on my cluster node (crow) as 192.168.4.225 talks to a node port service hosted on crow

@jrife Just to confirm, the issue occurs when the datapath handles the reply traffic from the backends at nodeport_rev_dnat_ingress_ipv[4,6], and it hits a stale ARP cache entry for the client 192.168.4.225?

We can rely on the managed and extern_learn neighbor entries for the backends in the forward path. (With #37352, the Cilium agent manages these entries even if the backends are external hosts.) However, we can't rely on them in the reply path, as the Cilium agent doesn't manage the client's neighbor entries.

@jrife
Copy link
Contributor Author

jrife commented Feb 20, 2025

@jrife Just to confirm, the issue occurs when the datapath handles the reply traffic from the backends at nodeport_rev_dnat_ingress_ipv[4,6], and it hits a stale ARP cache entry for the client 192.168.4.225?

This is correct.

@jrife
Copy link
Contributor Author

jrife commented Mar 19, 2025

Bumping this thread again, since I'm not sure what the verdict was here @ysksuzuki .

This comment was marked as resolved.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 19, 2025
@jrife
Copy link
Contributor Author

jrife commented Apr 21, 2025

@borkmann @ysksuzuki Looks like this one went stale. I'm wondering what the resolution is here. Is there anything in particular you are concerned about or data you would like to see?

@github-actions github-actions bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 22, 2025
@julianwiedmann julianwiedmann self-requested a review April 22, 2025 13:12
@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels May 6, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels May 6, 2025
@julianwiedmann
Copy link
Member

My immediate reaction to this was "ugh, this feels like a workaround for something that should be improved in the kernel. And it won't help bpf_xdp when encountering the same problem".

At the same time, once #38308 is merged I feel it's a good moment to take a fresh look at all our FIB-driven redirect handling. And see whether we can simplify the logic more, or extend it to also cover scenarios like this.

@jrife
Copy link
Contributor Author

jrife commented May 6, 2025

My immediate reaction to this was "ugh, this feels like a workaround for something that should be improved in the kernel. And it won't help bpf_xdp when encountering the same problem".

From my point of view, it's less a kernel issue and more that the behavior of bpf_redirect skips a step that the kernel would normally perform on the way out, namely engaging with the neighboring subsystem and triggering side-effects like neighbor resolution on stale entires. bpf_redirect_neigh does do this stuff, so all things being equal (including performance) I'm not sure why you wouldn't just prefer redirect_neigh here; it seems like you'd be less likely to hit weird edge cases like this. Any insight into why bpf_redirect is currently preferred here?

@borkmann
Copy link
Member

borkmann commented May 7, 2025

My immediate reaction to this was "ugh, this feels like a workaround for something that should be improved in the kernel. And it won't help bpf_xdp when encountering the same problem".

From my point of view, it's less a kernel issue and more that the behavior of bpf_redirect skips a step that the kernel would normally perform on the way out, namely engaging with the neighboring subsystem and triggering side-effects like neighbor resolution on stale entires. bpf_redirect_neigh does do this stuff, so all things being equal (including performance) I'm not sure why you wouldn't just prefer redirect_neigh here; it seems like you'd be less likely to hit weird edge cases like this. Any insight into why bpf_redirect is currently preferred here?

Yes, we should just stick to redirect_neigh and remove the corner cases. For XDP it's unfortunately not possible to resolve anything ad-hoc, so there we need to rely on managed neigh to keep entries fresh and have this corner case wrt staleness for non-managed neighbors.

@julianwiedmann
Copy link
Member

At the same time, once #38308 is merged I feel it's a good moment to take a fresh look at all our FIB-driven redirect handling. And see whether we can simplify the logic more, or extend it to also cover scenarios like this.

Started here, would be cool to rebase once this lands and see how easy the code looks overall.

@jrife
Copy link
Contributor Author

jrife commented May 13, 2025

Started #39490, would be cool to rebase once this lands and see how easy the code looks overall.

Sure I can wait for this to merge. This PR probably needs a rebase in general, since it's a bit old.

@jrife
Copy link
Contributor Author

jrife commented Aug 15, 2025

ci-gateway-api failure looks like #40243. Retriggering.

@jrife
Copy link
Contributor Author

jrife commented Aug 15, 2025

/ci-ginkgo

@jrife
Copy link
Contributor Author

jrife commented Aug 15, 2025

/ci-gateway-api

@jrife
Copy link
Contributor Author

jrife commented Aug 15, 2025

ci-ginkgo is failing fairly consistently with

• Failure [173.568 seconds]
K8sDatapathServicesTest
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
  Checks N/S loadbalancing
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    With ClusterIP external access
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
      ClusterIP can be accessed when external access is enabled [It]
      /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

      cannot curl to service IP from host: time-> DNS: '0.000022()', Connect: '0.000000',Transfer '0.000000', total '0.000000'command terminated with exit code 7
      
      Expected command: kubectl exec -n kube-system log-gatherer-52wsk -- curl -k --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 20 [http://10.96.114.208:80/](http://10.96.114.208/) -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" 
      To succeed, but it failed:
      Exitcode: 7 
      Err: exit status 7
      Stdout:
       	 time-> DNS: '0.000022()', Connect: '0.000000',Transfer '0.000000', total '0.000000'
      Stderr:
       	 command terminated with exit code 7

At a glance, this seems unrelated to this PR, as this is the first time I'm seeing this and started happening after I rebased. Inspecting the most recent runs I see several runs with the same failure:

I'll do a bit more investigation and perhaps file a CI bug.

Edit: Another PR with the same failure: #41190

@joestringer
Copy link
Member

FWIW looking at the above links, if you browse to "Echo Workflow Dispatch Inputs" and expand the step inside, it's clear that these results are each from different PRs. The last link skipped that step, which implies the workflow was triggered directly (likely due to schedule) rather than via workflow_dispatch, so that likely means that it was also observed directly on a branch.

@jrife
Copy link
Contributor Author

jrife commented Aug 15, 2025

@joestringer Thanks. I just discovered the "Echo Workflow Dispatch Inputs" step after desperately searching for a way to correlate the run with its origin. I updated my description above to point to another example I found. Since this seems like a CI regression and not localized to this PR, I'll open an issue.

@jrife
Copy link
Contributor Author

jrife commented Aug 18, 2025

/test

Copy link
Contributor

@rgo3 rgo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only left one more nit, otherwise this LGTM for loader changes. Thanks for your patience Jordan!

@pchaigno pchaigno enabled auto-merge August 19, 2025 06:31
jrife added 3 commits August 19, 2025 08:35
In some environments, neighbor entries for non-cluster hosts remain
stale forever, breaking connectivity in cases where their IP is
reassigned to a new host with a different MAC address. This isn't a
problem for hosts within the cluster; Cilium sets the `managed` and
`extern_learn` flags for those neighbor entries, so the kernel actively
refreshes them.

In our particular case, we have a set of VMs forming our Kubernetes
cluster running a NodePort service and another VM in the same L2
domain talking to that service. This set of VMs is upgraded as a unit
by spinning up new VMs, draining the old ones, and reassigning the IP
addresses, although not necessarily in that order. At some point during
this process, the MAC address assigned to the IP for that external VM
changes. The network switches in our setup enable ARP suppression
which, as I understand it, suppresses GARPs if the ARP suppression cache
on the switch already has an entry for a given MAC address. This,
combined with Cilium's current behavior leads to a situation where we
we need to apply some workarounds to get things talking again. What we
hope would occur is that cluster nodes would send an ARP request for
any STALE neighbor entries in their ARP table, but currently this never
happens.

fib_redirect() and others call bpf_fib_lookup() and pass its result
along to fib_do_redirect(). Currently, fib_do_redirect() only uses
bpf_redirect_neigh() if the FIB lookup was unsuccessful, returning
BPF_FIB_LKUP_RET_NO_NEIGH. Otherwise, it "manually" populates the L2
header of the packet and uses bpf_redirect() to send it out through the
host interface. However, bpf_fib_lookup() returns
BPF_FIB_LKUP_RET_SUCCESS regardless of the state of the neighbor entry,
and bpf_redirect() doesn't trigger a neighbor event in the kernel. The
net result is that stale entries can remain stale which is problematic
in setups like the one described above.

Always use bpf_redirect_neigh() if available to trigger refresh of stale
entries and avoid corner cases like ours where invalid entries can
can remain stale forever.

Without This Change
-------------------
The neighbor entry for 192.168.4.255, a VM in the same L2 domain as my
cluster nodes, remains STALE after making an HTTP request to a NodePort
service hosted on that node.

jordan@crow:~$ ip neigh | grep 225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE
...
jordan@crow:~$ ip neigh | grep 225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE

With This Change
----------------
The HTTP request from 192.168.4.255 triggers a neighbor event on the
cluster node, so the neighbor entry transitions from STALE to
REACHABLE.

jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE
...
jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 DELAY
...
jordan@crow:~$ ip neigh | grep 192.168.4.225
192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 REACHABLE

Fixes: cilium#34503
Signed-off-by: Jordan Rife <jrife@google.com>
Prepare for the next patch which passes BPF_FIB_LOOKUP_SKIP_NEIGH to
bpf_fib_lookup in cases where we want to defer to bpf_redirect_neigh for
neighbor resolution by adding a probe to see if the current system
supports it.

Fixes: cilium#34503
Signed-off-by: Jordan Rife <jrife@google.com>
Use BPF_FIB_LOOKUP_SKIP_NEIGH in cases where the system supports it and
we want to defer to bpf_redirect_neigh for neighbor resolution.

Fixes: cilium#34503
Signed-off-by: Jordan Rife <jrife@google.com>
@jrife jrife force-pushed the jrife/always-use-bpf_redirect_neigh branch from d101914 to a7aeb35 Compare August 19, 2025 15:43
@pchaigno
Copy link
Member

/test

@jrife
Copy link
Contributor Author

jrife commented Aug 19, 2025

The ci-multi-pool failure seems to be an existing flake looking at a series of recent runs. Created #41283 with details.

@jrife
Copy link
Contributor Author

jrife commented Aug 19, 2025

/ci-multi-pool

@jrife
Copy link
Contributor Author

jrife commented Aug 19, 2025

Retriggering ci-eks. Looks like the node groups couldn't be created (#40836). Link

@jrife
Copy link
Contributor Author

jrife commented Aug 19, 2025

/ci-eks

@pchaigno pchaigno added this pull request to the merge queue Aug 19, 2025
Merged via the queue into cilium:main with commit 430dd4f Aug 19, 2025
68 checks passed
Comment on lines +28 to +29
NODE_CONFIG(bool, supports_fib_lookup_skip_neigh,
"Whether or not BPF_FIB_LOOKUP_SKIP_NEIGH is supported.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education - how would we accurately test this in bpf/complexity-tests ?

In the past we would simply add the relevant HAVE_FEATURE_FOO macro to the configs for relevant kernel versions. But for load-time configs that's no longer possible ...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #40367.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ok so on a casual look the new verifier-loader still needs to learn (for instance here) how to set up SupportsFibLookupSkipNeigh according to the tested kernel version.

And that matches the fact that ci-verifier currently runs all its permutations with SupportsFibLookupSkipNeigh:false, even on bpf-next.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when we add new load time config option or convert from compile time we should add them to the ...Permutations function(s). We still have to figure out which combinations we want to run and which we don't care about as much (we are likely most interested in combinations that result in higher complexity, not all permutations possible).

Since that logic in Go, we should indeed be able to use probes instead of hard-coding settings to certain kernel versions.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stale neighbor entry keeps breaking nodeport return packets
8 participants