Skip to content

Conversation

gandro
Copy link
Member

@gandro gandro commented Oct 30, 2024

Once this PR is merged, a GitHub action will update the labels of these PRs:

 35614

gandro added 10 commits October 30, 2024 16:08
[ upstream commit 5d1951b ]

According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. #32099), the logged error message can still cause CI
to fail (e.g. #35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit abbe2b8 ]

See commit `netlink: Add wrapper for functions that fail with
`ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 428d2c4 ]

See commit `netlink: Add wrapper for functions that fail with
`ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 7facbaf ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit dec70b4 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit c271659 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 91659fd ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit acf9e04 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit e8c4d45 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit ef780e2 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro added kind/backports This PR provides functionality previously merged into master. backport/1.16 This PR represents a backport for Cilium 1.16.x of a PR that was merged to main. labels Oct 30, 2024
@gandro gandro marked this pull request as ready for review October 30, 2024 15:25
@gandro gandro requested a review from a team as a code owner October 30, 2024 15:25
Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌯

[ upstream commit d7668b0 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 5d38674 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

In addition, this commit also removes some existing retry mechanisms with the
canoncial safenetlink version.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit caf6088 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit c0c3244 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 04b6de8 ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit f8cb55d ]

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

This commit is a bit different from the previous ones as it uses the
retry helper rather than a wrapper.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 3824b6b ]

This adds a new check script to ensure any functions which has a safe
wrapper in the `safenetlink` package uses the safe variant.

See commit `netlink: Add wrapper for functions that fail with
ErrDumpInterrupted` for details.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/v1.16-backport-2024-10-30-04-08 branch from 0fedbd7 to 0037c95 Compare October 30, 2024 15:32
@gandro
Copy link
Member Author

gandro commented Oct 30, 2024

/test-backport-1.16

@gandro gandro added this pull request to the merge queue Oct 31, 2024
Merged via the queue into v1.16 with commit 68a7b74 Oct 31, 2024
278 of 279 checks passed
@gandro gandro deleted the pr/v1.16-backport-2024-10-30-04-08 branch October 31, 2024 10:04
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Oct 31, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.16 This PR represents a backport for Cilium 1.16.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants