Skip to content

datapath: Retry netlink.LinkList when interrupted #35259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Oct 7, 2024

Retry netlink.LinkList call if it returns EINTR error, due to an upstream netlink library change to fail if NLM_F_DUMP_INTR flag is set in the netlink response (vishvananda/netlink#925).

After we get our netlink dependency updated to include vishvananda/netlink#1018, we can change this retry condition to look for netlink.ErrDumpInterrupted instead.

This fixes this failure mode seen frequently in the CI recently:

time="2024-09-26T13:15:45.933969654Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: rendering vlan filter macros: listing network interfaces: interrupted system call" subsys=datapath-loader ...
time="2024-09-26T13:15:46.100906287Z" level=error msg="Failed to compile bpf_lxc.o: exit status 1" compiler-pid=793 subsys=datapath-loader time="2024-09-26T13:15:46.100944558Z" level=warning msg="In file included from /var/lib/cilium/bpf/bpf_lxc.c:18:" subsys=datapath-loader time="2024-09-26T13:15:46.100961449Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/auth.h:6:" subsys=datapath-loader time="2024-09-26T13:15:46.100969916Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/common.h:1279:" subsys=datapath-loader time="2024-09-26T13:15:46.100978361Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/overloadable.h:10:" subsys=datapath-loader time="2024-09-26T13:15:46.100987608Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/overloadable_skb.h:8:" subsys=datapath-loader time="2024-09-26T13:15:46.100997066Z" level=warning msg="/var/lib/cilium/bpf/lib/clustermesh.h:18:23: error: use of undeclared identifier '__config_identity_length'" subsys=datapath-loader
time="2024-09-26T13:15:46.101011563Z" level=warning msg=" 18 | __u32 identity_len = CONFIG(identity_length);" subsys=datapath-loader
time="2024-09-26T13:15:46.101018676Z" level=warning msg=" | ^" subsys=datapath-loader
time="2024-09-26T13:15:46.101138079Z" level=warning msg="/var/lib/cilium/bpf/lib/static_data.h:75:10: note: expanded from macro 'CONFIG'" subsys=datapath-loader

Fixes: #35055

Cilium no longer fails compiling bpf programs if listing network links is interrupted.

Retry netlink.LinkList call if it returns EINTR error. This should fix
this failure mode seen frequently in the CI:

time="2024-09-26T13:15:45.933969654Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: rendering vlan filter macros: listing network interfaces: interrupted system call" subsys=datapath-loader
...
time="2024-09-26T13:15:46.100906287Z" level=error msg="Failed to compile bpf_lxc.o: exit status 1" compiler-pid=793 subsys=datapath-loader
time="2024-09-26T13:15:46.100944558Z" level=warning msg="In file included from /var/lib/cilium/bpf/bpf_lxc.c:18:" subsys=datapath-loader
time="2024-09-26T13:15:46.100961449Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/auth.h:6:" subsys=datapath-loader
time="2024-09-26T13:15:46.100969916Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/common.h:1279:" subsys=datapath-loader
time="2024-09-26T13:15:46.100978361Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/overloadable.h:10:" subsys=datapath-loader
time="2024-09-26T13:15:46.100987608Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/overloadable_skb.h:8:" subsys=datapath-loader
time="2024-09-26T13:15:46.100997066Z" level=warning msg="/var/lib/cilium/bpf/lib/clustermesh.h:18:23: error: use of undeclared identifier '__config_identity_length'" subsys=datapath-loader
time="2024-09-26T13:15:46.101011563Z" level=warning msg="   18 |         __u32 identity_len = CONFIG(identity_length);" subsys=datapath-loader
time="2024-09-26T13:15:46.101018676Z" level=warning msg="      |                              ^" subsys=datapath-loader
time="2024-09-26T13:15:46.101138079Z" level=warning msg="/var/lib/cilium/bpf/lib/static_data.h:75:10: note: expanded from macro 'CONFIG'" subsys=datapath-loader

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme added kind/bug This is a bug in the Cilium logic. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Oct 7, 2024
@jrajahalme jrajahalme requested a review from a team as a code owner October 7, 2024 08:10
@jrajahalme jrajahalme requested a review from rgo3 October 7, 2024 08:10
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Oct 7, 2024
@jrajahalme
Copy link
Member Author

/test

@jrajahalme jrajahalme added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Oct 7, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 7, 2024
@jrajahalme jrajahalme requested review from rgo3 and ti-mo October 8, 2024 19:50
@jrajahalme
Copy link
Member Author

Maybe listing network interfaces is kind of susceptible for NLM_F_DUMP_INTR given that busy nodes have new veths coming and going very frequently?

@jrajahalme jrajahalme changed the title datapath: Retry netlink.LinkList on EINTR datapath: Retry netlink.LinkList when interrupted Oct 13, 2024
@jrajahalme jrajahalme added needs-backport/1.14 needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch release-blocker/1.17 This issue will prevent the release of the next version of Cilium. and removed needs-backport/1.14 needs-backport/1.16 This PR / issue needs backporting to the v1.16 branch labels Oct 13, 2024
@jrajahalme jrajahalme enabled auto-merge October 15, 2024 17:40
@jrajahalme
Copy link
Member Author

@ti-mo Could you consider accepting this or proposing another fix soonish? I keep hitting this in my PRs pretty consistently (e.g. https://github.com/cilium/cilium/actions/runs/11418520360/attempts/2)

@jrajahalme jrajahalme requested review from rgo3 and removed request for rgo3 October 20, 2024 21:08
@jrajahalme jrajahalme added this pull request to the merge queue Oct 20, 2024
Merged via the queue into cilium:main with commit 934a6d8 Oct 20, 2024
75 checks passed
@jrajahalme jrajahalme deleted the datapath-retry-linklist-on-interrupted-system-call branch October 20, 2024 21:17
gandro added a commit to gandro/cilium that referenced this pull request Oct 29, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Oct 29, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit to gandro/cilium that referenced this pull request Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. cilium#32099), the logged error message can still cause CI
to fail (e.g. cilium#35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. #32099), the logged error message can still cause CI
to fail (e.g. #35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro added a commit that referenced this pull request Oct 30, 2024
[ upstream commit 5d1951b ]

According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. #32099), the logged error message can still cause CI
to fail (e.g. #35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Oct 31, 2024
[ upstream commit 5d1951b ]

According to the kernel docs[^1], the kernel can return incomplete
results for netlink state dumps if the state changes while we are
dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The
`vishvananda/netlink` library returned `EINTR` since v1.2.1, but more
recent versions have changed it such that it returns
`netlink.ErrDumpInterrupted` instead[^2].

These interruptions seem common in high-churn environments. If the error
occurs, it is in most cases best to just try again.  Therefore, this
commit adds a wrapper for all `netlink` functions marked to return
`ErrDumpInterrupted` that retries the function up to 30 times until it
either succeeds or returns a different error.

While may call sites do have their own high-level retry mechanism (see
e.g. #32099), the logged error message can still cause CI
to fail (e.g. #35259). Long high-level retry intervals can
also become problematic: For example, if the routing setup fails due to
`NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add
several seconds of additional delay to an already overloaded system,
instead of resolving the issue quickly.

A subsequent commit will add an additional linter that nudges developers
to use this new `safenetlink` package for function calls that can be
interrupted. This ensures that we don't have to add retries in all
subsystems individually.

[^1]: https://docs.kernel.org/userspace-api/netlink/intro.html
[^2]: vishvananda/netlink#1018

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/bug This is a bug in the Cilium logic. release-blocker/1.17 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent Unable to write node config header: interrupted system call
3 participants