ipam: retry netlink.LinkList call when setting up ENI devices #32099

jasonaliyetti · 2024-04-19T23:11:49Z

LinkList is prone to interrupts which are surfaced by the netlink library. This leads to stability issues when using the ENI datapath. This change makes it part of the retry loop in waitForNetlinkDevices.

Fixes: #31974

Please ensure your pull request adheres to the following guidelines:

For first time contributors, read Submitting a pull request
All code is covered by unit and/or runtime tests where feasible.
All commits contain a well written commit description including a title,
description and a Fixes: #XXX line if the commit addresses a particular
GitHub issue.
If your commit description contains a Fixes: <commit-id> tag, then
please add the commit author[s] as reviewer[s] to this issue.
All commits are signed off. See the section Developer’s Certificate of Origin
Provide a title or release-note blurb suitable for the release notes.
Are you a user of Cilium? Please add yourself to the Users doc
Thanks for contributing!

LinkList is prone to interrupts which are surfaced by the netlink library. This leads to stability issues when using the ENI datapath. This change makes it part of the retry loop in waitForNetlinkDevices. Fixes: cilium#31974 Signed-off-by: Jason Aliyetti <jaliyetti@gmail.com>

gandro

Thanks!

gandro · 2024-04-22T09:09:54Z

/test

gandro · 2024-04-22T14:34:15Z

Congratulations on your first contribution! 🚀

jasonaliyetti · 2024-04-22T15:00:19Z

Congratulations on your first contribution! 🚀

Thanks! Anything I need to do for the backports or is that automated?

gandro · 2024-04-22T15:21:57Z

Congratulations on your first contribution! 🚀

Thanks! Anything I need to do for the backports or is that automated?

Backports should be handled by the backporter team, nothing to do from your side. If they run in any issue, they will let you know in this PR

According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

[ upstream commit 5d1951b ] According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

jasonaliyetti requested a review from a team as a code owner April 19, 2024 23:11

jasonaliyetti requested a review from gandro April 19, 2024 23:11

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 19, 2024

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Apr 19, 2024

gandro approved these changes Apr 22, 2024

View reviewed changes

gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. area/eni Impacts ENI based IPAM. needs-backport/1.13 labels Apr 22, 2024

maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 22, 2024

jasonaliyetti mentioned this pull request Apr 22, 2024

CI: Conformance EKS: Installation and Connectivity Test (1.24, ca-west-1): [check-log-errors]: failed to obtain eni link list: interrupted system call #30990

Closed

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 22, 2024

gandro added this pull request to the merge queue Apr 22, 2024

Merged via the queue into cilium:main with commit cf9bde5 Apr 22, 2024

jasonaliyetti deleted the cilium-31974 branch April 22, 2024 15:48

thejar mentioned this pull request Apr 22, 2024

ENI setup issues in 1.15.3 on EKS using ENI datapath #31974

Closed

3 tasks

gandro mentioned this pull request Apr 29, 2024

v1.15 Backports 2024-04-29 #32230

Merged

18 tasks

gandro added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Apr 29, 2024

gandro mentioned this pull request Apr 30, 2024

v1.14 Backports 2024-04-30 #32251

Merged

13 tasks

gandro added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels Apr 30, 2024

gandro mentioned this pull request Apr 30, 2024

v1.13 Backports 2024-04-30 #32252

Merged

10 tasks

gandro added backport-pending/1.13 and removed needs-backport/1.13 labels Apr 30, 2024

aanm mentioned this pull request May 2, 2024

Prepare for release v1.14.11 aanm/cilium#643

Merged

This was referenced May 10, 2024

Prepare for release v1.13.16 #32458

Merged

Prepare for release v1.14.11 #32460

Merged

Prepare for release v1.15.5 #32470

Merged

mhofstetter mentioned this pull request May 17, 2024

ipam: lower loglevel from error to warn if eni link list can't be listed #32602

Merged

jasonaliyetti mentioned this pull request Jul 1, 2024

IPsec errors in 1.15.6 #33507

Closed

3 tasks

gandro mentioned this pull request Oct 29, 2024

treewide: Add wrapper for netlink functions that may fail with ErrDumpInterrupted #35614

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ipam: retry netlink.LinkList call when setting up ENI devices #32099

ipam: retry netlink.LinkList call when setting up ENI devices #32099

Uh oh!

jasonaliyetti commented Apr 19, 2024 •

edited

Loading

Uh oh!

gandro left a comment

Uh oh!

gandro commented Apr 22, 2024

Uh oh!

gandro commented Apr 22, 2024

Uh oh!

jasonaliyetti commented Apr 22, 2024

Uh oh!

gandro commented Apr 22, 2024 •

edited

Loading

Uh oh!

Uh oh!

ipam: retry netlink.LinkList call when setting up ENI devices #32099

ipam: retry netlink.LinkList call when setting up ENI devices #32099

Uh oh!

Conversation

jasonaliyetti commented Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

gandro commented Apr 22, 2024

Uh oh!

gandro commented Apr 22, 2024

Uh oh!

jasonaliyetti commented Apr 22, 2024

Uh oh!

gandro commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jasonaliyetti commented Apr 19, 2024 •

edited

Loading

gandro commented Apr 22, 2024 •

edited

Loading