Skip to content

Conversation

gandro
Copy link
Member

@gandro gandro commented Oct 29, 2024

According to the kernel docs1, the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by NLM_F_DUMP_INTR. The vishvananda/netlink library returned EINTR since v1.2.1, but more recent versions have changed it such that it returns netlink.ErrDumpInterrupted instead2.

These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all netlink functions marked to return ErrDumpInterrupted that retries the function up to 30 times until it either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to NLM_F_DUMP_INTR during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly.

The last commit adds an additional linter that nudges developers to use this new safenetlink package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually.

Review per commit

Footnotes

  1. https://docs.kernel.org/userspace-api/netlink/intro.html

  2. https://github.com/vishvananda/netlink/pull/1018

@gandro gandro added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. backport/author The backport will be carried out by the author of the PR. affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Oct 29, 2024
@gandro gandro force-pushed the pr/gandro/netlink-retries branch 2 times, most recently from e4f6449 to ad479a5 Compare October 29, 2024 14:23
@gandro gandro marked this pull request as ready for review October 29, 2024 14:29
@gandro gandro requested review from a team as code owners October 29, 2024 14:29
See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

In addition, this commit also removes some existing retry mechanisms with the
canoncial safenetlink version.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

This commit is a bit different from the previous ones as it uses the
retry helper rather than a wrapper.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This adds a new check script to ensure any functions which has a safe
wrapper in the `safenetlink` package uses the safe variant.

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/netlink-retries branch from ceddefd to feff2df Compare October 30, 2024 11:09
@gandro
Copy link
Member Author

gandro commented Oct 30, 2024

Did you also look for usages of the netlink/nl pkg?

I only found one usage which might be buggy:

switch m.Header.Type {
case unix.NLMSG_DONE:
break loop

All other users are only importing types, but are not actually invoking netlink requests.

I believe that function should check for NLM_F_DUMP_INTR, but I don't know the code well enough to implement that. In particular, I'm also not sure if all that code could be replaced by safenetlink.SocketDiagUDP*. cc @borkmann

@gandro
Copy link
Member Author

gandro commented Oct 30, 2024

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 30, 2024
@gandro gandro added this pull request to the merge queue Oct 30, 2024
Merged via the queue into cilium:main with commit 3824b6b Oct 30, 2024
64 checks passed
@gandro gandro deleted the pr/gandro/netlink-retries branch October 30, 2024 15:05
@gandro gandro mentioned this pull request Oct 30, 2024
1 task
@gandro gandro added backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. and removed needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Oct 30, 2024
@github-actions github-actions bot added backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. and removed backport-pending/1.16 The backport for Cilium 1.16.x for this PR is in progress. labels Oct 31, 2024
gandro added a commit to gandro/cilium that referenced this pull request Jun 25, 2025
This ensures that safenetlink package is used across the whole tree, in
particular also in unit tests (where EINTR can cause flaky tests) and
some invocations via handle which were previously missed (see
cilium#35614 for context why safenetlink is required).

This prepares the tree for the use of the forbidigo linter via golangci.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Jun 25, 2025
This ensures that safenetlink package is used across the whole tree, in
particular also in unit tests (where EINTR can cause flaky tests) and
some invocations via handle which were previously missed (see
cilium#35614 for context why safenetlink is required).

This prepares the tree for the use of the forbidigo linter via golangci.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Jun 25, 2025
This ensures that safenetlink package is used across the whole tree, in
particular also in unit tests (where EINTR can cause flaky tests) and
some invocations via handle which were previously missed (see
cilium#35614 for context why safenetlink is required).

This prepares the tree for the use of the forbidigo linter via golangci.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Jun 30, 2025
This ensures that safenetlink package is used across the whole tree, in
particular also in unit tests (where EINTR can cause flaky tests) and
some invocations via handle which were previously missed (see
cilium#35614 for context why safenetlink is required).

This prepares the tree for the use of the forbidigo linter via golangci.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Jun 30, 2025
This ensures that safenetlink package is used across the whole tree, in
particular also in unit tests (where EINTR can cause flaky tests) and
some invocations via handle which were previously missed (see
#35614 for context why safenetlink is required).

This prepares the tree for the use of the forbidigo linter via golangci.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
rabelmervin pushed a commit to rabelmervin/cilium that referenced this pull request Aug 18, 2025
This ensures that safenetlink package is used across the whole tree, in
particular also in unit tests (where EINTR can cause flaky tests) and
some invocations via handle which were previously missed (see
cilium#35614 for context why safenetlink is required).

This prepares the tree for the use of the forbidigo linter via golangci.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.14 This issue affects v1.14 branch affects/v1.15 This issue affects v1.15 branch area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. backport/author The backport will be carried out by the author of the PR. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants