bpf: split off debug options and do not run it in ci #11977

borkmann · 2020-06-09T09:30:48Z

Related: #11175

maintainer-s-little-helper · 2020-06-09T09:30:50Z

Please set the appropriate release note label.

borkmann · 2020-06-09T09:36:42Z

retest-4.19

coveralls · 2020-06-09T10:36:47Z

Coverage decreased (-0.02%) to 37.024% when pulling 5e94d68 on pr/fix-4.19-verifier into e5bc626 on master.

borkmann · 2020-06-09T11:02:26Z

retest-4.19

borkmann · 2020-06-09T11:16:27Z

retest-4.19

maintainer-s-little-helper · 2020-06-09T12:40:44Z

Commit a649270bc354037101a54c48eb87e5bfb076efbd does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

In particular running CI with datapath debugging enabled can be harmful since i) it does not represent what is running in a production environment, ii) it can hide verifier complexity or instruction size bugs for non-debug mode which are significantly harder to pin-point and fix 'after the fact'. Thus, run what would be loaded in a production environment in terms of BPF code and make that the first-class citizen in CI, not the other way round. Datapath debugging can still be enabled via CLI through: cilium config Debug=Enabled In the worst case, we can add a new flag to control it via param on agent start, but given this is mainly just for manual debugging, I left that out for now. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Add a small helper that we can use in our CI code. Signed-off-by: Paul Chaignon <paul@cilium.io>

The monitor tests are the only place where we check on DEBUG presence in the monitor output. Enable BPF datapath debugging there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Before this PR, the 4.19 CI pipeline was running probe: KubeProxyReplacement: Probe () [HostReachableServices (TCP, UDP), SessionAffinity] After this PR, it is switched to strict: KubeProxyReplacement: Strict (enp0s8) [NodePort (SNAT, 30000-32767, XDP: DISABLED), HostPort, ExternalIPs, HostReachableServices (TCP, UDP), SessionAffinity] It turns out that this breaks the up/downgrade test on the 1.7.4 image: [...] 2020-06-10T07:22:18.490936835Z level=info msg="Cilium 1.7.4 8ccccd9 2020-06-03T14:23:45-07:00 go version go1.13.12 linux/amd64" subsys=daemon 2020-06-10T07:22:18.491423561Z level=debug msg="Load 1-min: 0.16 5-min: 0.23 15min: 0.10" subsys=loadinfo type=background 2020-06-10T07:22:18.491534313Z level=debug msg="Memory: Total: 7974 Used: 922 (11.56%) Free: 4905 Buffers: 82 Cached: 2064" subsys=loadinfo type=background 2020-06-10T07:22:18.491684768Z level=debug msg="Swap: Total: 0 Used: 0 (0.00%) Free: 0" subsys=loadinfo type=background 2020-06-10T07:22:18.511350077Z level=info msg="cilium-envoy version: 63de0bd958d05d82e2396125dcf6286d92464c56/1.12.3/Modified/RELEASE/BoringSSL" subsys=daemon 2020-06-10T07:22:18.547219473Z level=info msg="clang (7.0.0) and kernel (4.19.57) versions: OK!" subsys=linux-datapath 2020-06-10T07:22:18.556721593Z level=info msg="linking environment: OK!" subsys=linux-datapath [...] 2020-06-10T07:22:26.088381054Z level=warning msg="+ tc filter replace dev cilium_vxlan ingress prio 1 handle 1 bpf da obj bpf_overlay.o sec from-overlay" subsys=datapath-loader 2020-06-10T07:22:26.088412542Z level=warning subsys=datapath-loader 2020-06-10T07:22:26.08847343Z level=warning msg="Prog section '2/19' rejected: Argument list too long (7)!" subsys=datapath-loader 2020-06-10T07:22:26.088477964Z level=warning msg=" - Type: 3" subsys=datapath-loader 2020-06-10T07:22:26.088480166Z level=warning msg=" - Attach Type: 0" subsys=datapath-loader 2020-06-10T07:22:26.088520672Z level=warning msg=" - Instructions: 4336 (240 over limit)" subsys=datapath-loader 2020-06-10T07:22:26.088524801Z level=warning msg=" - License: GPL" subsys=datapath-loader [...] Therefore as a workaround remove datapath BPF debugging output to fix the complexity issue on 1.7.4. Needs separate addressing on 1.7 via #12018 if we want to bring debugging back. Related: #12018 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Seen this brokeness from a Cilium upgrade test run: kubectl delete daemonset --namespace=kube-system cilium kubectl delete deployment --namespace=kube-system cilium-operator kubectl delete podsecuritypolicy cilium-psp cilium-operator-psp kubectl delete daemonset --namespace=kube-system cilium-node-init kubectl get pods --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' kubectl get pods -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n -n --all-namespaces -o jsonpath='{.items[*].metadata.deletionTimestamp}' This caused the test suite to wait for an invalid namespace which never came up. Fix it to move ns into a tmp var. Fixes: 82df172 ("test: Add single node conformance test") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

The host endpoint introduces some interesting interactions between the API, the host endpoint datapath configuration management and the recently proposed autodisabling of debug mode for the host endpoint. These issues are tracked in more detail here: #12037 For now, skip the host endpoint when iterating through endpoints in runtime tests to simplify working around this issue; coverage is otherwise the same as for previous releases. We can figure out the right long-term handling of the host endpoint in RuntimeMonitor tests via the above issue and unblock PR #11977 with this commit. Signed-off-by: Joe Stringer <joe@cilium.io>

borkmann · 2020-06-12T11:47:19Z

test-me-please

borkmann · 2020-06-12T12:20:23Z

Turns out the cherry-pick of 26a39a8 was triggering the remaining up/downgrade cleanup connection disruption in CI. With 8adf6fe in place and taking the cherry-pick out of the series let the test-focus run through just fine on net-next. I tested 4.19.57 with kube-proxy-replacement=strict and v4+v6 locally on my machine and it loads just fine with the rest of the series. So I've taken out mentioned commit and repushed everything, also took out the debug=false from Joe since not necessary. There are no other known issues for now on the rest of the series, so once this hits green, we can move forward and merge them in.

To avoid long waiting times on this one, I did a new PR #12045 where I reenable the kube-proxy-replacement probing on 4.19 given the combination that the commit tried to fix (modulo debug) works fine locally, so if that goes well, we can get the full kube-proxy free coverage on 4.19 and fix up BPF datapath debug as needed later on. Given it is highly likely that no-one is running the BPF debug in production/staging, we can just move on and fix up later on w/o being release-blocker.

pchaigno · 2020-06-12T12:28:48Z

Turns out the cherry-pick of 26a39a8 was triggering the remaining up/downgrade cleanup connection disruption in CI.

That was my first impression as well (and the reason I tried reordering the tail call numbers). I'm wondering if we have some issue when swapping the programs in the prog array 🤔 If I remember correctly tc doesn't flush the array, it just updates entries...

Could you remove bpf: Enforce host policies before BPF-based SNAT as well if it's not needed? I'd rather not merge a partial fix for that if possible as it make things worse until a get a full fix in.

How is the program size issue on CILIUM_CALL_NODEPORT_REVNAT addressed if 26a39a8 is dropped? I thought that issue was happening even with debug=false (it's the original issue reported in #11175).

borkmann · 2020-06-12T13:39:34Z

Turns out the cherry-pick of 26a39a8 was triggering the remaining up/downgrade cleanup connection disruption in CI.

That was my first impression as well (and the reason I tried reordering the tail call numbers). I'm wondering if we have some issue when swapping the programs in the prog array 🤔 If I remember correctly tc doesn't flush the array, it just updates entries...

Currently debugging it, I was thinking about it as well. The path in the test is 1.7.4 -> latest (upgrade) -> 1.7.4 (downgrade) -> latest (general cleanup for next test). If we don't use some tail call slots we would drop due to missing tail call. Otoh, the current config should load what it has enabled. Hmm, I'll debug further here.

Could you remove bpf: Enforce host policies before BPF-based SNAT as well if it's not needed? I'd rather not merge a partial fix for that if possible as it make things worse until a get a full fix in.

Hm, true, otoh, test suite currently passes and service tests on 4.19 with kube-proxy free on v4 instead of v4+v6 enabled also pass right now. Maybe lets get this merged as-is right now to unblock and hopefully till tonight I have the other one fixed so we can merge it as well. Either way before final 1.8.0 it's going to be fixed.

[ upstream commit 7cfe6f2 ] The host endpoint introduces some interesting interactions between the API, the host endpoint datapath configuration management and the recently proposed autodisabling of debug mode for the host endpoint. These issues are tracked in more detail here: #12037 For now, skip the host endpoint when iterating through endpoints in runtime tests to simplify working around this issue; coverage is otherwise the same as for previous releases. We can figure out the right long-term handling of the host endpoint in RuntimeMonitor tests via the above issue and unblock PR #11977 with this commit. Signed-off-by: Joe Stringer <joe@cilium.io>

borkmann · 2020-06-13T10:59:37Z

Turns out the cherry-pick of 26a39a8 was triggering the remaining up/downgrade cleanup connection disruption in CI.

That was my first impression as well (and the reason I tried reordering the tail call numbers). I'm wondering if we have some issue when swapping the programs in the prog array 🤔 If I remember correctly tc doesn't flush the array, it just updates entries...

Currently debugging it, I was thinking about it as well. The path in the test is 1.7.4 -> latest (upgrade) -> 1.7.4 (downgrade) -> latest (general cleanup for next test). If we don't use some tail call slots we would drop due to missing tail call. Otoh, the current config should load what it has enabled. Hmm, I'll debug further here.

Found the issue, it's drop due to missed tail call which then causes the connection disruption during up/downgrade. Will post fix & full rest of the series for #12045 on Mon.

[ upstream commit 7cfe6f2 ] The host endpoint introduces some interesting interactions between the API, the host endpoint datapath configuration management and the recently proposed autodisabling of debug mode for the host endpoint. These issues are tracked in more detail here: #12037 For now, skip the host endpoint when iterating through endpoints in runtime tests to simplify working around this issue; coverage is otherwise the same as for previous releases. We can figure out the right long-term handling of the host endpoint in RuntimeMonitor tests via the above issue and unblock PR #11977 with this commit. Signed-off-by: Joe Stringer <joe@cilium.io>

The test for fragment tracking support got a fix with commit 0e772e7 ("test: Fix fragment tracking test under KUBEPROXY=1"), where the pattern for searching entries in the Conntrack table accounts for DNAT not happening if kube-proxy is present. Following recent changes in the datapath and tests for the Linux 4.19 kernel, DNAT is now used even with kube-proxy, provided bpf_sock is in use. This led to CI failures, and the test was disabled for 4.19 kernels with commit 1120aed ("test/K8sServices: disable fragment tracking test for kernel 4.19"). Now that complexity issues are fixed (see #11977 and #12045), let's enable the test on 4.19 again. Ignore DNAT only if kube-proxy is present and bpf_sock (host-reachable services) is not in use. This is also the case for net-next kernels (this didn't fail in CI before because we do not test with kube-proxy on net-next). Note that (as far as I know) both 4.19 and net-next always use bpf_sock in CI runs, so the check on hostReachableServices is currently superfluous. Let's have it all the same, in case something changes in the future, to avoid unexpected breakage. Signed-off-by: Quentin Monnet <quentin@isovalent.com>

[ upstream commit 5b9503f ] The test for fragment tracking support got a fix with commit 0e772e7 ("test: Fix fragment tracking test under KUBEPROXY=1"), where the pattern for searching entries in the Conntrack table accounts for DNAT not happening if kube-proxy is present. Following recent changes in the datapath and tests for the Linux 4.19 kernel, DNAT is now used even with kube-proxy, provided bpf_sock is in use. This led to CI failures, and the test was disabled for 4.19 kernels with commit 1120aed ("test/K8sServices: disable fragment tracking test for kernel 4.19"). Now that complexity issues are fixed (see #11977 and #12045), let's enable the test on 4.19 again. Ignore DNAT only if kube-proxy is present and bpf_sock (host-reachable services) is not in use. This is also the case for net-next kernels (this didn't fail in CI before because we do not test with kube-proxy on net-next). Note that (as far as I know) both 4.19 and net-next always use bpf_sock in CI runs, so the check on hostReachableServices is currently superfluous. Let's have it all the same, in case something changes in the future, to avoid unexpected breakage. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: André Martins <andre@cilium.io>

borkmann requested review from pchaigno and a team June 9, 2020 09:30

borkmann requested review from a team as code owners June 9, 2020 09:30

borkmann requested a review from a team June 9, 2020 09:30

maintainer-s-little-helper bot added the dont-merge/needs-release-note label Jun 9, 2020

borkmann added the release-note/misc This PR makes changes that have no direct user impact. label Jun 9, 2020

maintainer-s-little-helper bot removed the dont-merge/needs-release-note label Jun 9, 2020

borkmann requested a review from brb June 9, 2020 09:31

borkmann requested a review from a team as a code owner June 9, 2020 09:36

borkmann force-pushed the pr/fix-4.19-verifier branch from 76f70a3 to c4268eb Compare June 9, 2020 10:01

borkmann requested review from a team as code owners June 9, 2020 10:01

borkmann requested a review from a team June 9, 2020 10:01

borkmann added pending-review area/CI Continuous Integration testing issue or flake area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Jun 9, 2020

borkmann force-pushed the pr/fix-4.19-verifier branch 2 times, most recently from 987addf to 2191734 Compare June 9, 2020 11:16

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 9, 2020

borkmann and others added 6 commits June 12, 2020 13:45

test: RunsOn419Kernel() helper

12186dd

Add a small helper that we can use in our CI code. Signed-off-by: Paul Chaignon <paul@cilium.io>

test: enable/disable debug messages for monitor in runtime

a85ef2f

The monitor tests are the only place where we check on DEBUG presence in the monitor output. Enable BPF datapath debugging there. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

borkmann force-pushed the pr/fix-4.19-verifier branch from 887ab24 to 5e94d68 Compare June 12, 2020 11:47

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 12, 2020

borkmann merged commit 7cfe6f2 into master Jun 12, 2020

borkmann deleted the pr/fix-4.19-verifier branch June 12, 2020 13:43

pchaigno mentioned this pull request Jun 12, 2020

Followups for host endpoint and firewall #11799

Closed

34 tasks

joestringer mentioned this pull request Jun 12, 2020

v1.8 backports 2020-06-11-2 #12039

Merged

joestringer added backport-pending/1.8 and removed needs-backport/1.8 labels Jun 12, 2020

pchaigno mentioned this pull request Jun 15, 2020

bpf: Fix 4.19 program size and complexity issues #11915

Closed

aanm added backport-done/1.8 and removed backport-pending/1.8 labels Jun 16, 2020

qmonnet mentioned this pull request Jun 17, 2020

test/K8sServices: re-enable IPv4 fragment tests on kernel 4.19 #12159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bpf: split off debug options and do not run it in ci #11977

bpf: split off debug options and do not run it in ci #11977

Uh oh!

borkmann commented Jun 9, 2020 •

edited

Loading

Uh oh!

maintainer-s-little-helper bot commented Jun 9, 2020

Uh oh!

borkmann commented Jun 9, 2020

Uh oh!

coveralls commented Jun 9, 2020 •

edited

Loading

Uh oh!

borkmann commented Jun 9, 2020

Uh oh!

borkmann commented Jun 9, 2020

Uh oh!

maintainer-s-little-helper bot commented Jun 9, 2020

Uh oh!

borkmann commented Jun 12, 2020

Uh oh!

borkmann commented Jun 12, 2020

Uh oh!

pchaigno commented Jun 12, 2020

Uh oh!

borkmann commented Jun 12, 2020 •

edited

Loading

Uh oh!

borkmann commented Jun 13, 2020

Uh oh!

Uh oh!

bpf: split off debug options and do not run it in ci #11977

bpf: split off debug options and do not run it in ci #11977

Uh oh!

Conversation

borkmann commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maintainer-s-little-helper bot commented Jun 9, 2020

Uh oh!

borkmann commented Jun 9, 2020

Uh oh!

coveralls commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

borkmann commented Jun 9, 2020

Uh oh!

borkmann commented Jun 9, 2020

Uh oh!

maintainer-s-little-helper bot commented Jun 9, 2020

Uh oh!

borkmann commented Jun 12, 2020

Uh oh!

borkmann commented Jun 12, 2020

Uh oh!

pchaigno commented Jun 12, 2020

Uh oh!

borkmann commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

borkmann commented Jun 13, 2020

Uh oh!

Uh oh!

borkmann commented Jun 9, 2020 •

edited

Loading

coveralls commented Jun 9, 2020 •

edited

Loading

borkmann commented Jun 12, 2020 •

edited

Loading