Skip to content

Conversation

rgo3
Copy link
Contributor

@rgo3 rgo3 commented Jan 9, 2024

This commit fixes a bug where maybeUnloadObsoleteXDPPrograms() removes XDP programs after a restart that are still in use and should remain in place. This can cause intermittent connectivity issues.

The issue is that the kernel returns different values when it is queried for the attach mode of a netlink device compared to the values used when configuring the attach mode. E.g. when attaching with XDPDriverMode which evaluates to '4' in the input flags, querying it will return XDP_ATTACHED_DRV which evaluates to '1'. So when maybeUnloadObsoleteXDPPrograms() compares the queried values to the used input values there can be a mismatch which leads to cilium removing still needed XDP programs.

This commit also changes the test, to be a suitable regression test for this fix. Previously the test was using XDPGenericMode. Unfortunately in this case the returned value from the kernel when querying netlink devices is XDP_ATTACHED_SKB, and both constants evaluate to '2' which is why this bug wasn't caught by the test in the first place.

Fixes: #30132

This commit fixes a bug where `maybeUnloadObsoleteXDPPrograms()` removes
XDP programs after a restart that are still in use and should remain in
place. This can cause intermittent connectivity issues.

The issue is that the kernel returns different values when it is queried
for the attach mode of a netlink device compared to the values used when
configuring the attach mode. E.g. when attaching with `XDPDriverMode`
which evaluates to '4' in the input flags, querying it will return
`XDP_ATTACHED_DRV` which evaluates to '1'. So when
`maybeUnloadObsoleteXDPPrograms()` compares the queried values to the
used input values there can be a mismatch which leads to cilium removing
still needed XDP programs.

This commit also changes the test, to be a suitable regression test for
this fix. Previously the test was using `XDPGenericMode`. Unfortunately
in this case the returned value from the kernel when querying netlink
devices is `XDP_ATTACHED_SKB`, and both constants evaluate to '2' which
is why this bug wasn't caught by the test in the first place.

Fixes: cilium#30132

Signed-off-by: Robin Gögge <r.goegge@isovalent.com>
@rgo3 rgo3 requested a review from a team as a code owner January 9, 2024 20:39
@rgo3 rgo3 requested a review from dylandreimerink January 9, 2024 20:39
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 9, 2024
@rgo3 rgo3 added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.13 labels Jan 9, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 9, 2024
@rgo3
Copy link
Contributor Author

rgo3 commented Jan 9, 2024

/test

@julianwiedmann julianwiedmann added area/loader Impacts the loading of BPF programs into the kernel. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. kind/bug This is a bug in the Cilium logic. labels Jan 10, 2024
@oblazek
Copy link
Contributor

oblazek commented Jan 10, 2024

this indeed fixes my issue :)

@oblazek
Copy link
Contributor

oblazek commented Jan 10, 2024

also verified here with these changes #30114

@dylandreimerink
Copy link
Member

CI is green except for Gink-go conformance, which is hitting flake #30101. Marking ready-to-merge

@dylandreimerink dylandreimerink added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 10, 2024
@dylandreimerink dylandreimerink merged commit ea7c375 into cilium:main Jan 10, 2024
@ti-mo ti-mo deleted the fix-obsolete-xdp-removal branch January 10, 2024 15:59
@rgo3 rgo3 added the backport/author The backport will be carried out by the author of the PR. label Jan 10, 2024
@rgo3 rgo3 added backport-pending/1.13 backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Jan 12, 2024
@gentoo-root gentoo-root added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.13 backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jan 17, 2024
borkmann added a commit that referenced this pull request Feb 12, 2024
The test is currently causing a lot of flakes, so it looks like we
need to revisit #30163 to figure out why it is still not addressed.

Related: #30114
Closes: #30707
Closes: #24728
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
borkmann added a commit that referenced this pull request Feb 12, 2024
The test is currently causing a lot of flakes, so it looks like we
need to revisit #30163 to figure out why it is still not addressed.

Related: #30114
Closes: #30707
Closes: #24728
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
dlapcevic pushed a commit to dlapcevic/cilium that referenced this pull request Feb 13, 2024
The test is currently causing a lot of flakes, so it looks like we
need to revisit cilium#30163 to figure out why it is still not addressed.

Related: cilium#30114
Closes: cilium#30707
Closes: cilium#24728
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pionerd pushed a commit to Pionerd/cilium that referenced this pull request Feb 13, 2024
The test is currently causing a lot of flakes, so it looks like we
need to revisit cilium#30163 to figure out why it is still not addressed.

Related: cilium#30114
Closes: cilium#30707
Closes: cilium#24728
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loader Impacts the loading of BPF programs into the kernel. backport/author The backport will be carried out by the author of the PR. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug This is a bug in the Cilium logic. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cilium detaches still in use XDP programs on restart
5 participants