-
Notifications
You must be signed in to change notification settings - Fork 3.4k
cilium, ci: Add netkit with per-endpoint-routes #35542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/ci-e2e-upgrade |
@jrife Fyi, looks like the general bpf/bpf-next tree update leads to a kubeapiserver issue:
Given it's IPv6, my suspicion for now is this regression: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/netfilter?id=306ed1728e8438caed30332e1ab46b28c25fe3d8 Building the latest image now: cilium/little-vm-helper-images#728 and will update here . |
179b210
to
86ac379
Compare
/ci-e2e-upgrade |
86ac379
to
dd122d2
Compare
Hm, seems still something off:
edit: Added net branch now as well which is what we need for the last two: cilium/little-vm-helper-images#729 |
I'm guessing the previous image(s) were building from |
Yes, I've just added support for net branch via cilium/little-vm-helper-images#729 . What is odd however is that the non-netkit tests seem to fail for the bpf kernel update. Maybe a kernel regression somewhere. |
I remember @aanm mentioning that newer bpf-next images were breaking CI recently. Perhaps a bisect is required here if bpf-next/net also has issues. |
What are the repro steps to hit the issue mentioned above? Is it just "deploy Cilium on the latest bpf-next/master kernel" or is there a particular connectivity test that fails? |
From the sysdump of the e2e test (https://github.com/cilium/cilium/actions/runs/11519637468) for the bpf-next tests (e.g. which are not using netkit) it looks like the agent comes up but there is a connection refused error. Need to double check again. I think it should be reproducible in a kind environment with running the cilium-cli test suite as the e2e test does. |
(fwiw, this one now has |
Looking at cilium-2pqr6 from the failed e2e sysdump, test 7:
Looks like Cilium is up but the echo Pods are in CrashLoopBackOff, the cat--var-run-cilium-cilium-cni.log.md however does not show any level=warn or level=error. status verbose shows the following.. maybe red herring, not sure:
fwiw, ip n from sysdump seems ok on first glance:
The agent does not have any warn or error level logs aside from unrelated:
dmesg looks clean as well kubelet event log:
edit: The degraded comes from:
|
I wasn't able to reproduce this in a kind cluster, but did reproduce the issue locally by just downloading the
To offer another data point, I tried this test with the |
If you have some cycles that would be awesome, thanks! I might otherwise not get to it before KubeCon. |
Cc @jschwinger233 found that it looks like an issue with too many open files :
Also, Gray mentioned: lvh-image already set 512 max_user_instances by writing /etc/sysctl.conf, but somehow kernel doesn't load the file until I manually run sysctl -p. https://github.com/search?q=repo%3Acilium%2Flittle-vm-helper-images%20sysctl&type=code
Maybe this is the root cause: new kernel somehow no longer loads sysctl from /etc/sysctl.conf on boot. |
I can confirm that running |
(related PR: cilium/little-vm-helper-images#731 ) |
dd122d2
to
1b2a503
Compare
/ci-e2e-upgrade |
Looks like with the new fix that test suite is still failing: https://github.com/cilium/cilium/actions/runs/11570233767/job/32210457502 |
I may have some time today. I'll run again locally and see if I can reproduce the failures from the latest run. |
OK, I was able to reproduce this new problem. I hit this while running
Looking at the pods in
This all seems like the original problem in #34042. Is it possible that either the kernel build doesn't have the Netkit scrub patch or the Cilium version used doesn't have #35306? I just used the same build and tags referenced in the logs from the failed test job. |
root@kind-bpf-net:/host/netlink# cat cmd/netkit-get/netkit-get.go
package main
import (
"fmt"
"os"
"github.com/vishvananda/netlink"
)
func main() {
result, err := netlink.LinkByName(os.Args[1])
if err != nil {
fmt.Printf("Error: %v\n", err)
os.Exit(1)
}
fmt.Printf("%+v\n", result)
} I wrote this small utility to check the netkit configuration for one of those Pod's LXC interfaces and it looks correct. |
Hmm strange, after switching the datapath mode to
|
Was there anything log-wise in the two echo Pods which could give further hints?
|
Nothing interesting there, although looking more at the Cilium agent logs I think this is the issue
This comes out of Edit: Hmm could this possibly be because
I don't have time to check today, but could try doing a local kernel build with this set to see if it resolves the issue later. |
It is not set in the build: But it also seems not listed as part of the requirements: My guess is if it would be required then also current CI would break, but worth a try. Edit: Ah interesting.. we do use it fwiw: cilium/pkg/datapath/iptables/iptables.go Line 524 in 39165a1
|
Jordan reported that CONFIG_NETFILTER_XT_TARGET_MARK is missing in CI kernels [0]. We should also it to the documentation to make it clear that it is needed. Reported-by: Jordan Rife <jrife@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: #35542 (comment) [0]
Needed for Cilium's L7 proxy [0]. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: Link: cilium/cilium#35542 (comment) [0]
It looks like we might be hitting the same bug as #35436 (comment) . And bpf-next/net does not yet have the fix: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/log/net/netfilter?h=net (fix: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=306ed1728e8438caed30332e1ab46b28c25fe3d8). I'll see to push out the bpf-next PR till end of week. |
Jordan reported that CONFIG_NETFILTER_XT_TARGET_MARK is missing in CI kernels [0]. We should also it to the documentation to make it clear that it is needed. Reported-by: Jordan Rife <jrife@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: #35542 (comment) [0]
I can confirm this fixes the issue with readiness probes after trying on my machine. |
Needed for Cilium's L7 proxy [0]. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: Link: cilium/cilium#35542 (comment) [0]
Needed for Cilium's L7 proxy [0]. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: Link: cilium/cilium#35542 (comment) [0]
Jordan fixed Cilium with netkit and per-endpoint-routes in #35306. Given we have a more recent bpf image now, lets add it also to CI to regularly test for regressions. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
1b2a503
to
d6c48c0
Compare
/ci-e2e-upgrade |
@jrife Nice tests are green now! 🎉 |
[ upstream commit 123f374 ] Jordan reported that CONFIG_NETFILTER_XT_TARGET_MARK is missing in CI kernels [0]. We should also it to the documentation to make it clear that it is needed. Reported-by: Jordan Rife <jrife@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: #35542 (comment) [0] Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 123f374 ] Jordan reported that CONFIG_NETFILTER_XT_TARGET_MARK is missing in CI kernels [0]. We should also it to the documentation to make it clear that it is needed. Reported-by: Jordan Rife <jrife@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: #35542 (comment) [0] Signed-off-by: Jussi Maki <jussi@isovalent.com>
(see commits)