Skip to content

Conversation

jasonaliyetti
Copy link
Contributor

@jasonaliyetti jasonaliyetti commented Apr 19, 2024

LinkList is prone to interrupts which are surfaced by the netlink library. This leads to stability issues when using the ENI datapath. This change makes it part of the retry loop in waitForNetlinkDevices.

Fixes: #31974

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

LinkList is prone to interrupts which are surfaced by the netlink library.  This leads to stability issues when using the ENI datapath.  This change makes it part of the retry loop in waitForNetlinkDevices.

Fixes: cilium#31974
Signed-off-by: Jason Aliyetti <jaliyetti@gmail.com>
@jasonaliyetti jasonaliyetti requested a review from a team as a code owner April 19, 2024 23:11
@jasonaliyetti jasonaliyetti requested a review from gandro April 19, 2024 23:11
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 19, 2024
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Apr 19, 2024
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. area/eni Impacts ENI based IPAM. needs-backport/1.13 labels Apr 22, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 22, 2024
@gandro
Copy link
Member

gandro commented Apr 22, 2024

/test

@gandro
Copy link
Member

gandro commented Apr 22, 2024

Congratulations on your first contribution! 🚀

Merged via the queue into cilium:main with commit cf9bde5 Apr 22, 2024
@jasonaliyetti
Copy link
Contributor Author

Congratulations on your first contribution! 🚀

Thanks! Anything I need to do for the backports or is that automated?

@gandro
Copy link
Member

gandro commented Apr 22, 2024

Congratulations on your first contribution! 🚀

Thanks! Anything I need to do for the backports or is that automated?

Backports should be handled by the backporter team, nothing to do from your side. If they run in any issue, they will let you know in this PR

@jasonaliyetti jasonaliyetti deleted the cilium-31974 branch April 22, 2024 15:48
@gandro gandro mentioned this pull request Apr 29, 2024
18 tasks
@gandro gandro added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 labels Apr 29, 2024
@gandro gandro mentioned this pull request Apr 30, 2024
13 tasks
@gandro gandro added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 labels Apr 30, 2024
@gandro gandro mentioned this pull request Apr 30, 2024
10 tasks
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.13 backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels May 2, 2024
@jasonaliyetti jasonaliyetti mentioned this pull request Jul 1, 2024
3 tasks
gandro added a commit to gandro/cilium that referenced this pull request Oct 29, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Oct 29, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. #32099), the logged error message can still cause CI
to fail (e.g. #35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit that referenced this pull request Oct 30, 2024
[ upstream commit 5d1951b ]

According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. #32099), the logged error message can still cause CI
to fail (e.g. #35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Oct 31, 2024
[ upstream commit 5d1951b ]

According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. #32099), the logged error message can still cause CI
to fail (e.g. #35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eni Impacts ENI based IPAM. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
Status: Released
Status: Released
Status: Released
Development

Successfully merging this pull request may close these issues.

ENI setup issues in 1.15.3 on EKS using ENI datapath
2 participants