-
Notifications
You must be signed in to change notification settings - Fork 3.4k
bpf: fib: always use bpf_redirect_neigh() when available #37725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: fib: always use bpf_redirect_neigh() when available #37725
Conversation
a97bbea
to
b342aed
Compare
e72bf8a
to
d176280
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! I have two questions.
- Have you considered making this behavior, where bpf_redirect_neigh() is always used when available, opt-in instead of enabling it by default? Or have you already evaluated that the potential performance impact would be trivial?
- Does this issue occur only when the destination is a host outside the cluster? If so, would it be worth considering an optimization where bpf_redirect_neigh() is always used only when the destination is outside the cluster?
Hey Jordan, just a question for clarification/better understanding: In the kernel's BPF fib lookup helper, we test nud_state for NUD_VALID (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/filter.c#n6180):
The NUD_VALID is defined as (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/net/neighbour.h#n39):
The kernel has a
Dumb q: Would it rather make sense to tweak base reachable time instead? Is that mostly for NUD_REACHABLE state or does it also relate to NUD_STALE somehow? Are you saying with redirect_neigh() the NUD_STALE automatically triggers a new resolution to go back into NUD_REACHABLE? Edit: found the state diagram: Ok, so in the kernel the transition from NUD_REACHABLE -> NUD_STALE does trigger re-resolution:
|
Is it correct to assume that the above entries are the backend IPs? If yes, would this PR which is in main already have helped? #37352 |
Thank you for taking a look! In my view this should simply be the default behavior if there are no ill effects, as it works for a broader set of circumstances. I've run some iperf3 benchmarks at 100 Gbps and did not experience any noticeable changes in throughput or CPU utilization between this and the old approach. I will try to find the measurements somewhere and provide them here. I'm happy to run some additional benchmarks if you are not convinced though :).
The issue I experienced does, yes. Outside the cluster but in the same L2 domain as the cluster nodes. I believe it may further complicate things in
If my read of the docs and code is correct, I'm not sure tweaking this solves the underlying issue. Seems like it just takes longer for the entry to go stale? A long enough timeout may mask the issue I described in one environment, but as with any timeout value, what works in one environment may not work for another.
Right, any subsequent calls to
static inline int neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
{
return neigh_event_send_probe(neigh, skb, true);
}
To clarify, All in all, Are there performance reasons for the current preference for |
@jrife Just to confirm, the issue occurs when the datapath handles the reply traffic from the backends at We can rely on the |
This is correct. |
Bumping this thread again, since I'm not sure what the verdict was here @ysksuzuki . |
This comment was marked as resolved.
This comment was marked as resolved.
@borkmann @ysksuzuki Looks like this one went stale. I'm wondering what the resolution is here. Is there anything in particular you are concerned about or data you would like to see? |
My immediate reaction to this was "ugh, this feels like a workaround for something that should be improved in the kernel. And it won't help At the same time, once #38308 is merged I feel it's a good moment to take a fresh look at all our FIB-driven redirect handling. And see whether we can simplify the logic more, or extend it to also cover scenarios like this. |
From my point of view, it's less a kernel issue and more that the behavior of |
Yes, we should just stick to |
Started here, would be cool to rebase once this lands and see how easy the code looks overall. |
Sure I can wait for this to merge. This PR probably needs a rebase in general, since it's a bit old. |
/ci-ginkgo |
/ci-gateway-api |
At a glance, this seems unrelated to this PR, as this is the first time I'm seeing this and started happening after I rebased. Inspecting the most recent runs I see several runs with the same failure:
I'll do a bit more investigation and perhaps file a CI bug. |
FWIW looking at the above links, if you browse to "Echo Workflow Dispatch Inputs" and expand the step inside, it's clear that these results are each from different PRs. The last link skipped that step, which implies the workflow was triggered directly (likely due to schedule) rather than via |
@joestringer Thanks. I just discovered the "Echo Workflow Dispatch Inputs" step after desperately searching for a way to correlate the run with its origin. I updated my description above to point to another example I found. Since this seems like a CI regression and not localized to this PR, I'll open an issue. |
6ab39c0
to
d101914
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only left one more nit, otherwise this LGTM for loader changes. Thanks for your patience Jordan!
In some environments, neighbor entries for non-cluster hosts remain stale forever, breaking connectivity in cases where their IP is reassigned to a new host with a different MAC address. This isn't a problem for hosts within the cluster; Cilium sets the `managed` and `extern_learn` flags for those neighbor entries, so the kernel actively refreshes them. In our particular case, we have a set of VMs forming our Kubernetes cluster running a NodePort service and another VM in the same L2 domain talking to that service. This set of VMs is upgraded as a unit by spinning up new VMs, draining the old ones, and reassigning the IP addresses, although not necessarily in that order. At some point during this process, the MAC address assigned to the IP for that external VM changes. The network switches in our setup enable ARP suppression which, as I understand it, suppresses GARPs if the ARP suppression cache on the switch already has an entry for a given MAC address. This, combined with Cilium's current behavior leads to a situation where we we need to apply some workarounds to get things talking again. What we hope would occur is that cluster nodes would send an ARP request for any STALE neighbor entries in their ARP table, but currently this never happens. fib_redirect() and others call bpf_fib_lookup() and pass its result along to fib_do_redirect(). Currently, fib_do_redirect() only uses bpf_redirect_neigh() if the FIB lookup was unsuccessful, returning BPF_FIB_LKUP_RET_NO_NEIGH. Otherwise, it "manually" populates the L2 header of the packet and uses bpf_redirect() to send it out through the host interface. However, bpf_fib_lookup() returns BPF_FIB_LKUP_RET_SUCCESS regardless of the state of the neighbor entry, and bpf_redirect() doesn't trigger a neighbor event in the kernel. The net result is that stale entries can remain stale which is problematic in setups like the one described above. Always use bpf_redirect_neigh() if available to trigger refresh of stale entries and avoid corner cases like ours where invalid entries can can remain stale forever. Without This Change ------------------- The neighbor entry for 192.168.4.255, a VM in the same L2 domain as my cluster nodes, remains STALE after making an HTTP request to a NodePort service hosted on that node. jordan@crow:~$ ip neigh | grep 225 192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE ... jordan@crow:~$ ip neigh | grep 225 192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE With This Change ---------------- The HTTP request from 192.168.4.255 triggers a neighbor event on the cluster node, so the neighbor entry transitions from STALE to REACHABLE. jordan@crow:~$ ip neigh | grep 192.168.4.225 192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 STALE ... jordan@crow:~$ ip neigh | grep 192.168.4.225 192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 DELAY ... jordan@crow:~$ ip neigh | grep 192.168.4.225 192.168.4.225 dev enp5s0 lladdr bc:24:11:93:7f:d4 REACHABLE Fixes: cilium#34503 Signed-off-by: Jordan Rife <jrife@google.com>
Prepare for the next patch which passes BPF_FIB_LOOKUP_SKIP_NEIGH to bpf_fib_lookup in cases where we want to defer to bpf_redirect_neigh for neighbor resolution by adding a probe to see if the current system supports it. Fixes: cilium#34503 Signed-off-by: Jordan Rife <jrife@google.com>
Use BPF_FIB_LOOKUP_SKIP_NEIGH in cases where the system supports it and we want to defer to bpf_redirect_neigh for neighbor resolution. Fixes: cilium#34503 Signed-off-by: Jordan Rife <jrife@google.com>
d101914
to
a7aeb35
Compare
/test |
The |
/ci-multi-pool |
/ci-eks |
NODE_CONFIG(bool, supports_fib_lookup_skip_neigh, | ||
"Whether or not BPF_FIB_LOOKUP_SKIP_NEIGH is supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own education - how would we accurately test this in bpf/complexity-tests
?
In the past we would simply add the relevant HAVE_FEATURE_FOO
macro to the configs for relevant kernel versions. But for load-time configs that's no longer possible ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #40367.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ok so on a casual look the new verifier-loader still needs to learn (for instance here) how to set up SupportsFibLookupSkipNeigh
according to the tested kernel version.
And that matches the fact that ci-verifier
currently runs all its permutations with SupportsFibLookupSkipNeigh:false
, even on bpf-next
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when we add new load time config option or convert from compile time we should add them to the ...Permutations function(s). We still have to figure out which combinations we want to run and which we don't care about as much (we are likely most interested in combinations that result in higher complexity, not all permutations possible).
Since that logic in Go, we should indeed be able to use probes instead of hard-coding settings to certain kernel versions.
In some environments, neighbor entries for non-cluster hosts remain stale forever, breaking connectivity in cases where their IP is reassigned to a new host with a different MAC address. This isn't a problem for hosts within the cluster; Cilium sets the
managed
andextern_learn
flags for those neighbor entries, so the kernel actively refreshes them.In our particular case, we have a set of VMs forming our Kubernetes cluster running a NodePort service and another VM in the same L2 domain talking to that service. This set of VMs is upgraded as a unit by spinning up new VMs, draining the old ones, and reassigning the IP addresses, although not necessarily in that order. At some point during this process, the MAC address assigned to the IP for that external VM changes. The network switches in our setup enable ARP suppression which, as I understand it, suppresses GARPs if the ARP suppression cache on the switch already has an entry for a given MAC address. This, combined with Cilium's current behavior leads to a situation where we we need to apply some workarounds to get things talking again. What we hope would occur is that cluster nodes would send an ARP request for any STALE neighbor entries in their ARP table, but currently this never happens.
fib_redirect()
and others callbpf_fib_lookup()
and pass its result along tofib_do_redirect()
. Currently,fib_do_redirect()
only usesbpf_redirect_neigh()
if the FIB lookup was unsuccessful, returningBPF_FIB_LKUP_RET_NO_NEIGH
. Otherwise, it "manually" populates the L2 header of the packet and usesbpf_redirect()
to send it out through the host interface. However,bpf_fib_lookup()
returnsBPF_FIB_LKUP_RET_SUCCESS
regardless of the state of the neighbor entry, andbpf_redirect()
doesn't trigger a neighbor event in the kernel. The net result is that stale entries can remain stale which is problematic in setups like the one described above.The first commit makes it so Cilium always uses
bpf_redirect_neigh()
when available to trigger refresh of stale entries and avoid corner cases like ours where invalid entries can can remain stale forever.The last two commits use the
BPF_FIB_LOOKUP_SKIP_NEIGH
[1] flag in preceding calls tobpf_fib_lookup()
to avoid duplicate work in cases where we want to defer tobpf_redirect_neigh()
for the neighbor resolution. The first commit introduces a probe that ensures thatBPF_FIB_LOOKUP_SKIP_NEIGH
is supported on the current system, as some older kernels may supportbpf_fib_lookup()
but not theBPF_FIB_LOOKUP_SKIP_NEIGH
flag. The second commit plumbs through the probe result to let the FIB code decide whether or not to set it.[1]:
bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup
Without This Change
The neighbor entry for 192.168.4.255, a VM in the same L2 domain as my cluster nodes, remains STALE after making an HTTP request to a NodePort service hosted on that node.
With This Change
The HTTP request from 192.168.4.255 triggers a neighbor event on the cluster node, so the neighbor entry transitions from STALE to REACHABLE.
Fixes: #34503