-
Notifications
You must be signed in to change notification settings - Fork 3.4k
datapath: Retry netlink.LinkList when interrupted #35259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jrajahalme
merged 1 commit into
cilium:main
from
jrajahalme:datapath-retry-linklist-on-interrupted-system-call
Oct 20, 2024
Merged
datapath: Retry netlink.LinkList when interrupted #35259
jrajahalme
merged 1 commit into
cilium:main
from
jrajahalme:datapath-retry-linklist-on-interrupted-system-call
Oct 20, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Retry netlink.LinkList call if it returns EINTR error. This should fix this failure mode seen frequently in the CI: time="2024-09-26T13:15:45.933969654Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: rendering vlan filter macros: listing network interfaces: interrupted system call" subsys=datapath-loader ... time="2024-09-26T13:15:46.100906287Z" level=error msg="Failed to compile bpf_lxc.o: exit status 1" compiler-pid=793 subsys=datapath-loader time="2024-09-26T13:15:46.100944558Z" level=warning msg="In file included from /var/lib/cilium/bpf/bpf_lxc.c:18:" subsys=datapath-loader time="2024-09-26T13:15:46.100961449Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/auth.h:6:" subsys=datapath-loader time="2024-09-26T13:15:46.100969916Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/common.h:1279:" subsys=datapath-loader time="2024-09-26T13:15:46.100978361Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/overloadable.h:10:" subsys=datapath-loader time="2024-09-26T13:15:46.100987608Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/overloadable_skb.h:8:" subsys=datapath-loader time="2024-09-26T13:15:46.100997066Z" level=warning msg="/var/lib/cilium/bpf/lib/clustermesh.h:18:23: error: use of undeclared identifier '__config_identity_length'" subsys=datapath-loader time="2024-09-26T13:15:46.101011563Z" level=warning msg=" 18 | __u32 identity_len = CONFIG(identity_length);" subsys=datapath-loader time="2024-09-26T13:15:46.101018676Z" level=warning msg=" | ^" subsys=datapath-loader time="2024-09-26T13:15:46.101138079Z" level=warning msg="/var/lib/cilium/bpf/lib/static_data.h:75:10: note: expanded from macro 'CONFIG'" subsys=datapath-loader Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
/test |
rgo3
reviewed
Oct 7, 2024
Maybe listing network interfaces is kind of susceptible for |
@ti-mo Could you consider accepting this or proposing another fix soonish? I keep hitting this in my PRs pretty consistently (e.g. https://github.com/cilium/cilium/actions/runs/11418520360/attempts/2) |
ti-mo
approved these changes
Oct 20, 2024
gandro
added a commit
to gandro/cilium
that referenced
this pull request
Oct 29, 2024
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro
added a commit
to gandro/cilium
that referenced
this pull request
Oct 29, 2024
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro
added a commit
to gandro/cilium
that referenced
this pull request
Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro
added a commit
to gandro/cilium
that referenced
this pull request
Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro
added a commit
to gandro/cilium
that referenced
this pull request
Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Oct 30, 2024
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro
added a commit
that referenced
this pull request
Oct 30, 2024
[ upstream commit 5d1951b ] According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Oct 31, 2024
[ upstream commit 5d1951b ] According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
kind/bug
This is a bug in the Cilium logic.
release-blocker/1.17
This issue will prevent the release of the next version of Cilium.
release-note/bug
This PR fixes an issue in a previous release of Cilium.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Retry netlink.LinkList call if it returns EINTR error, due to an upstream netlink library change to fail if
NLM_F_DUMP_INTR
flag is set in the netlink response (vishvananda/netlink#925).After we get our netlink dependency updated to include vishvananda/netlink#1018, we can change this retry condition to look for
netlink.ErrDumpInterrupted
instead.This fixes this failure mode seen frequently in the CI recently:
time="2024-09-26T13:15:45.933969654Z" level=error msg="Unable to write node config header" error="failed to write node configuration file at /var/run/cilium/state/globals/node_config.h: rendering vlan filter macros: listing network interfaces: interrupted system call" subsys=datapath-loader ...
time="2024-09-26T13:15:46.100906287Z" level=error msg="Failed to compile bpf_lxc.o: exit status 1" compiler-pid=793 subsys=datapath-loader time="2024-09-26T13:15:46.100944558Z" level=warning msg="In file included from /var/lib/cilium/bpf/bpf_lxc.c:18:" subsys=datapath-loader time="2024-09-26T13:15:46.100961449Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/auth.h:6:" subsys=datapath-loader time="2024-09-26T13:15:46.100969916Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/common.h:1279:" subsys=datapath-loader time="2024-09-26T13:15:46.100978361Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/overloadable.h:10:" subsys=datapath-loader time="2024-09-26T13:15:46.100987608Z" level=warning msg="In file included from /var/lib/cilium/bpf/lib/overloadable_skb.h:8:" subsys=datapath-loader time="2024-09-26T13:15:46.100997066Z" level=warning msg="/var/lib/cilium/bpf/lib/clustermesh.h:18:23: error: use of undeclared identifier '__config_identity_length'" subsys=datapath-loader
time="2024-09-26T13:15:46.101011563Z" level=warning msg=" 18 | __u32 identity_len = CONFIG(identity_length);" subsys=datapath-loader
time="2024-09-26T13:15:46.101018676Z" level=warning msg=" | ^" subsys=datapath-loader
time="2024-09-26T13:15:46.101138079Z" level=warning msg="/var/lib/cilium/bpf/lib/static_data.h:75:10: note: expanded from macro 'CONFIG'" subsys=datapath-loader
Fixes: #35055