-
Notifications
You must be signed in to change notification settings - Fork 787
Preserve results when NLM_F_DUMP_INTR is set #1018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3b85385
to
fae5970
Compare
@adrianchiris, @aboch ... could you take a look? |
// ErrDumpInterrupted is an instance of errDumpInterrupted, used to report that | ||
// a netlink function has set the NLM_F_DUMP_INTR flag in a response message, | ||
// indicating that the results may be incomplete or inconsistent. | ||
var ErrDumpInterrupted = errDumpInterrupted{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use:
errors.New("results may be incomplete or inconsistent")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not equivalent: errDumpInterrupted
implements interface { Is(error) bool }
. errors.Is(errors.New(""), unix.EINTR) == false
while errors.Is(ErrDumpInterrupted, unix.EINTR) == true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are right.
@robmry just as an FYI, the check was introduced to align with how libmnl is executing netlink commands. see [1] can you provide more info on the use-case where its ok and needed to rely partial results instead of retry ? (except the ci failure u linked). TBH i favor sticking with how iproute2/libmnl implements execution of these cmds unless there is a good reason to deviate |
Thanks @adrianchiris - the moby (dockerd) maintainers were concerned about unbounded retries on the new So, the preference is to be able to retry a limited number of times and, in the unlikely event the retries fail, fall-back to using the partial data - on the basis that it'll be no worse than it was before the change, and no problems have been noticed/reported. As an example of where partial data might be used in a more considered way - in this package, Many callers are still likely to see the error is non-nil and just take an error path without looking at the data. iproute2 does something similar here https://github.com/iproute2/iproute2/blob/e9096586e0701d5ae031df2f2708d20d34ae7bd4/lib/libnetlink.c#L901 ... it doesn't retry on NLM_F_DUMP_INTR, but it's not a daemon and the user can choose to retry when they get an error. |
Yes, the goal of this library is to be in sync with iproute2 behavior as much as possible. Per go guidelines, but in general a good practice for any language, if the the function returns an non nil error, any other returned variable must not be trusted. The fact that you have each list method to still return the original INTR error is good, but if the parsing functions fail because the partial list is truncated at the element level (I do not know if it is possible) then the parsing function will fail and the returned error will be a parsing error. This will mask the real INTR error, making it harder to root cause. |
The There's nothing moby-specific about the change, or requirement ... it's discussed in my comment because I was responding to the request for a use-case, and moby is one. But, the
I can't find anything like that in https://go.dev/doc/effective_go#errors - I also thought it was unusual, but io.ReadAll is a counter-example. In this case, the error is very explicit that the results aren't to be trusted. The idea is to give the caller the choice (like the kernel and iproute2 do).
I don't think that's possible - individual results from a collection are copied into the netlink response with locks held, so they'll be fine. The flag indicates that a generation number for the collection changed as those results were being accumulated. (A reference to a position in the collection is stored between netlink responses to a request, but that position can get out of date. Even with a single netlink response message, collections can be changed between reads of elements of the collection.) |
Before v1.2.1, NLM_F_DUMP_INTR flag was not checked when parsing messages. If it was possible for the kernel to return a truncated list, someone would have seen and complained about parsing errors in the same circumstances where we are currently complaining about EINTR. This library itself is very strong evidence that the kernel does not pass ill-formed messages to userspace when a dump is interrupted. |
I guess you convinced me. |
Hi all, Adding some more points:
That said, i have no objection for the change as long as we still fail if results are inconsistent.
please note that the kernel refers to this data as "inconsistent" rather then partial. [1] https://github.com/torvalds/linux/blob/8d8d276ba2fb5f9ac4984f5c10ae60858090babc/Documentation/userspace-api/netlink/intro.rst#dump-consistency Line 1871 in a018296
|
fae5970
to
9c357a0
Compare
Done, except the error is
|
Yes, indeed. But, if the data is returned, the caller has the option of choosing to do something else.
Also macro rtnl_dump_filter, which is used to process results for a lot of functions that make
Agreed - but this PR already only changes functions that do an
I think that's what I described in the example? ("in this package, LinkByName can fall back to using linkByNameDump, which searches a dump for the name"), and now in the godoc comment describing the |
@robmry This project follows a "One change/fix/feature per pull request" and the "Squash and Merge" option is not enabled. |
9c357a0
to
1f76c75
Compare
Oh, sure - done ... I got rid of the gofmt changes, as they weren't directly related (only really there to stop my editor messing with code in the other commits). The other three commits were split to try to make reviews a bit easier:
(I've still got those changes ... happy to re-raise as separate PRs if preferred?) |
ipset_linux.go
Outdated
@@ -250,7 +250,7 @@ func (h *Handle) IpsetListAll() ([]IPSetResult, error) { | |||
result[i].unserialize(msg) | |||
} | |||
|
|||
return result, nil | |||
return result, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to check for ErrDumpInterrupted at L244?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you - I think I modified this one as it's a ...ListAll()
- then realised it doesn't set NLM_F_DUMP
, and didn't fully undo the change. I'll change the return back to nil
.
@@ -2440,6 +2464,9 @@ func linkSubscribeAt(newNs, curNs netns.NsHandle, ch chan<- LinkUpdate, done <-c | |||
continue | |||
} | |||
for _, m := range msgs { | |||
if m.Header.Flags&unix.NLM_F_DUMP_INTR != 0 && cberr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if for initial list would return dump interrupted there would be additional notifications from kernel (i think).
not sure we want this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly ... what additional notifications are you thinking of, and where would they go?
@@ -157,6 +157,9 @@ func (u *UnixSocket) deserialize(b []byte) error { | |||
} | |||
|
|||
// SocketGet returns the Socket identified by its local and remote addresses. | |||
// | |||
// If the returned error is [ErrDumpInterrupted], the search for a result may | |||
// be incomplete and the caller should retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATM this would return nil as Socket if any error happened at L219 (req.Execute()) is this intended ?
i.e in case of ErrDumpInterrupted we dont get the partial results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one returns specific errors if there's not exactly one result and, even if there is one socket, either of those is a possibility if the dump was interrupted.
So, if there's not exactly one result, ErrDumpInterrupted
should be returned rather than one of those other errors to indicate that a retry is needed.
But, I guess it'd be better to explicitly check for that case ... I'll update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one returns specific errors if there's not exactly one result ...
Hmm - looking at it more closely, that might have been what was intended, but it's not what it does ...
if len(msgs) == 0 {
return nil, errors.New("no message nor error from netlink")
}
if len(msgs) > 2 {
return nil, fmt.Errorf("multiple (%d) matching sockets", len(msgs))
}
So, if there are 2 results, it'll use the first and discard the second. I guess the second test should probably be > 1
or >= 2
.
Assuming that's a bug, fixing it doesn't belong in this PR - so I'll leave it alone here ... just returning nil, rather than making things more confusing.
Add a specific error to report that a netlink response had NLM_F_DUMP_INTR set, indicating that the set of results may be incomplete or inconsistent. unix.EINTR was previously returned (with no results) when the NLM_F_DUMP_INTR flag was set. Now, errors.Is(err, unix.EINTR) will still work. But, this will be a breaking change for any code that's checking for equality with unix.EINTR. Return results with ErrDumpInterrupted. Results may be incomplete or inconsistent, but give the caller the option of using them. Look for NLM_F_DUMP_INTR in more places: - linkSubscribeAt, neighSubscribeAt, routeSubscribeAt - can do an initial dump, which may report inconsistent results -> if there's an error callback, call it with ErrDumpInterrupted - socketDiagXDPExecutor - makes an NLM_F_DUMP request, without using Execute() -> give it the same behaviour as functions that do use Execute() Signed-off-by: Rob Murray <rob.murray@docker.com>
1f76c75
to
5da8257
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
…93d350 - Preserve results when NLM_F_DUMP_INTR is set vishvananda/netlink#1018 - Fix SetSendTimeout/SetReceiveTimeout vishvananda/netlink#1012 full diff: vishvananda/netlink@v1.3.0...084abd9 Signed-off-by: Rob Murray <rob.murray@docker.com>
@adrianchiris @aboch any chance on tagging a new release now that our fixes went in? |
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. cilium#32099), the logged error message can still cause CI to fail (e.g. cilium#35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 5d1951b ] According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 5d1951b ] According to the kernel docs[^1], the kernel can return incomplete results for netlink state dumps if the state changes while we are dumping it. The result is then marked by `NLM_F_DUMP_INTR`. The `vishvananda/netlink` library returned `EINTR` since v1.2.1, but more recent versions have changed it such that it returns `netlink.ErrDumpInterrupted` instead[^2]. These interruptions seem common in high-churn environments. If the error occurs, it is in most cases best to just try again. Therefore, this commit adds a wrapper for all `netlink` functions marked to return `ErrDumpInterrupted` that retries the function up to 30 times until it either succeeds or returns a different error. While may call sites do have their own high-level retry mechanism (see e.g. #32099), the logged error message can still cause CI to fail (e.g. #35259). Long high-level retry intervals can also become problematic: For example, if the routing setup fails due to `NLM_F_DUMP_INTR` during an CNI ADD invocation, the retry adds add several seconds of additional delay to an already overloaded system, instead of resolving the issue quickly. A subsequent commit will add an additional linter that nudges developers to use this new `safenetlink` package for function calls that can be interrupted. This ensures that we don't have to add retries in all subsystems individually. [^1]: https://docs.kernel.org/userspace-api/netlink/intro.html [^2]: vishvananda/netlink#1018 Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Should we have tag a new release with this change, it sounds like a very important change |
Similar to vishvananda#1018, but for ConntrackDeleteFilters() Relates to kubernetes/kubernetes#129562
Similar to vishvananda#1018, but for ConntrackDeleteFilters() Relates to kubernetes/kubernetes#129562
Similar to vishvananda#1018, but for ConntrackDeleteFilters() Relates to kubernetes/kubernetes#129562
Similar to #1018, but for ConntrackDeleteFilters() Relates to kubernetes/kubernetes#129562
Hunch is that the change in vishvananda/netlink#1018, will prevent EBUSY by draining the remaining messages from the kernel before returning. Before vishvananda/netlink#925, all messages would be read but the "interrupted" flag was ignored. That PR made it so that the dump would return early, but that would leave some messages unconsumed. There was logic elsewhere to discard these messages, but the kernel has a check for "dump in progress" at the start of a new dump and it returns EBUSY in that case: ``` int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, const struct nlmsghdr *nlh, struct netlink_dump_control *control) { ... mutex_lock(&nlk->nl_cb_mutex); /* A dump is in progress... */ if (nlk->cb_running) { ret = -EBUSY; goto error_unlock; } ``` I wasn't able to trace the exact path through the kernel code to verify when cb_running would be set, or that the error would be returned on the NLMSG_DONE as we see here but it seemed like a solid bet.
@aboch @vishvananda we need a new release which includes this change to unblock aws/amazon-vpc-cni-k8s#3196 |
1. fix rule test failed when rule add slow. disable broadcast if broadcast is set to net.IPv4zero remove comments about broadcast when deleting address remove another comment about broadcast auto calculation .github/workflows: Bump CI Go version to v1.22 Update the Go version we test against to Go v1.22 which is currently the oldest version still receiving security updates. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> 1. filter match support vlanId and srcMac, dstMac. 2. filter action support vlan pop/push. link_linux: Add deserialization of `IFF_RUNNING` flag Add deserialization of the `IFF_RUNNING` link flag which translates to `net.FlagRunning`. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Preserve results when NLM_F_DUMP_INTR is set Similar to vishvananda#1018, but for ConntrackDeleteFilters() Relates to kubernetes/kubernetes#129562 Add IFLA_PARENT_DEV_NAME / IFLA_PARENT_DEV_BUS_NAME to links These attributes are supported since kernel v5.14 (see [1]). Here's what iproute2 shows: ``` $ ip -d link show eth0 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 ... parentbus virtio parentdev virtio0 ``` [1]: torvalds/linux@00e77ed Signed-off-by: Albin Kerouanton <albinker@gmail.com> conntrack: prevent potential memory leak Currently, the ConntrackDeleteFilters captures all flow entries it fails to delete and reports them as errors. This behavior can potentially lead to memory leaks in high-traffic systems, where thousands of conntrack flow entries are cleared in a single batch. With this commit, instead of returning all the un-deleted flow entries, we now return a single error message for all of them. Signed-off-by: Daman Arora <aroradaman@gmail.com> Fix parsing 4-bytes attribute What if the data length of attribute is 4? The attribute will be ignored, because `i+4 < len(data)`. Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> fix: Use correct offset for unix socket diagnosis Signed-off-by: Sven Rebhan <srebhan@influxdata.com> vxlan: Fix parseVxlanData for source port range binary.Read() != nil check means error case, so the vxlan.Port{Low,High} are never populated. Fix the check. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> netkit: Allow setting MAC address in L2 mode Signed-off-by: Jordan Rife <jrife@google.com> Add support for MTU Lock When adding a route with "mtu lock <mtu>" path MTU discovery (PMTUD) will not be tried and packets will be sent without DF bit set. Upon receiving an ICMP needs frag due to PMTUD, the kernel will not install a cached route and lower the MTU. Signed-off-by: Tim Rozet <trozet@redhat.com> pedit: Fix EncodeActions to add TcGen for pedit action TcGen was missing in pedit action and the kernel cannont correctly process pedit action. Signed-off-by: Chen Tang <tangchen.1@bytedance.com> go.mod: github.com/vishvananda/netns v0.0.5 - Adding file path for nerdctl and finch full diff: vishvananda/netns@v0.0.4...v0.0.5 Signed-off-by: Sebastiaan van Stijn <github@gone.nl> Add `OifIndex` option for `RouteGetWithOptions` The `RouteGetWithOptions` function currently has a `Oif` option which gets translated from link name to link index via a `LinkByName` call. This adds unnecessary overhead when the link index is already known. This commit adds a new `OifIndex` option to `RouteGetWithOptions` which can be specified instead of `Oif` to skip the internal link index translation. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Support "sample" filter action This change adds support for packet sampling using "psample" kernel module. Added PCPU and SA fields to XfrmState Add support for ARP/ND Timestamps when retriving neighbors On Linux, Netlink provides NDA_CACHEINFO which carries timestamps about when ARP/ND was updated, used, and confirmed. Expose these fields in the Neigh type tuntap: parse additional netlink attributes for flags and queues Signed-off-by: Ivan Tsvetkov <ivanfromearth@gmail.com> tuntap: add support for dynamically managing multi-queue FDs Introduce AddQueues and RemoveQueues methods for attaching and detaching queue file descriptors to an existing TUN/TAP interface in multi-queue mode. This enables controlled testing of disabled queues and fine-grained queue management without relying on interface recreation. Signed-off-by: Ivan Tsvetkov <ivanfromearth@gmail.com> add SRv6 support for END.DT4 fix: add missing CLOEXEC flag Some calls were already using it, some were not, but fix the remaining ones. Without this flag, the file descriptor would to the child process after fork/exec. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> tests: Improve address unit test infrastructure Signed-off-by: kayos@tcp.direct <kayos@tcp.direct> addr_linux: don't require label to be prefixed with interface name This requirement limits the usefulness of labels (given the total label length can only be 15 characters). Signed-off-by: Julian Wiedmann <jwi@isovalent.com> feat: add IFLA_INET6_ADDR_GEN_MODE support geneve: Support setting/getting source port range Add support for geneve feature to specify source port range, see kernel commits: - e1f95b1992b8 ("geneve: Allow users to specify source port range") - 5a41a00cd5d5 ("geneve, specs: Add port range to rt_link specification") This is exactly equivalent on what is done in case of vxlan today. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> feat: add support for RtoMin lock veth: allow configuring peer attributes beyond namespace and address Signed-off-by: Gwendolyn <me@gwendolyn.dev> qdisc: fix wrong type info of tc_sfq_qopt Mimic `ipset` C code for determining correct default ipset revision Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io> bugfix: parse ipv4 src/dst error rdma: support rdma metrics: resource and statistic Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com> feat: add vlanid - tunnelid mapping support filter: add classid and port range support for flower vlan: add support for flags and qos maps Signed-off-by: Gwendolyn <me@gwendolyn.dev> Add support for the `expires` option of `ip route`
1. fix rule test failed when rule add slow. disable broadcast if broadcast is set to net.IPv4zero remove comments about broadcast when deleting address remove another comment about broadcast auto calculation .github/workflows: Bump CI Go version to v1.22 Update the Go version we test against to Go v1.22 which is currently the oldest version still receiving security updates. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> 1. filter match support vlanId and srcMac, dstMac. 2. filter action support vlan pop/push. link_linux: Add deserialization of `IFF_RUNNING` flag Add deserialization of the `IFF_RUNNING` link flag which translates to `net.FlagRunning`. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Preserve results when NLM_F_DUMP_INTR is set Similar to vishvananda#1018, but for ConntrackDeleteFilters() Relates to kubernetes/kubernetes#129562 Add IFLA_PARENT_DEV_NAME / IFLA_PARENT_DEV_BUS_NAME to links These attributes are supported since kernel v5.14 (see [1]). Here's what iproute2 shows: ``` $ ip -d link show eth0 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 ... parentbus virtio parentdev virtio0 ``` [1]: torvalds/linux@00e77ed Signed-off-by: Albin Kerouanton <albinker@gmail.com> conntrack: prevent potential memory leak Currently, the ConntrackDeleteFilters captures all flow entries it fails to delete and reports them as errors. This behavior can potentially lead to memory leaks in high-traffic systems, where thousands of conntrack flow entries are cleared in a single batch. With this commit, instead of returning all the un-deleted flow entries, we now return a single error message for all of them. Signed-off-by: Daman Arora <aroradaman@gmail.com> Fix parsing 4-bytes attribute What if the data length of attribute is 4? The attribute will be ignored, because `i+4 < len(data)`. Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> fix: Use correct offset for unix socket diagnosis Signed-off-by: Sven Rebhan <srebhan@influxdata.com> vxlan: Fix parseVxlanData for source port range binary.Read() != nil check means error case, so the vxlan.Port{Low,High} are never populated. Fix the check. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> netkit: Allow setting MAC address in L2 mode Signed-off-by: Jordan Rife <jrife@google.com> Add support for MTU Lock When adding a route with "mtu lock <mtu>" path MTU discovery (PMTUD) will not be tried and packets will be sent without DF bit set. Upon receiving an ICMP needs frag due to PMTUD, the kernel will not install a cached route and lower the MTU. Signed-off-by: Tim Rozet <trozet@redhat.com> pedit: Fix EncodeActions to add TcGen for pedit action TcGen was missing in pedit action and the kernel cannont correctly process pedit action. Signed-off-by: Chen Tang <tangchen.1@bytedance.com> go.mod: github.com/vishvananda/netns v0.0.5 - Adding file path for nerdctl and finch full diff: vishvananda/netns@v0.0.4...v0.0.5 Signed-off-by: Sebastiaan van Stijn <github@gone.nl> Add `OifIndex` option for `RouteGetWithOptions` The `RouteGetWithOptions` function currently has a `Oif` option which gets translated from link name to link index via a `LinkByName` call. This adds unnecessary overhead when the link index is already known. This commit adds a new `OifIndex` option to `RouteGetWithOptions` which can be specified instead of `Oif` to skip the internal link index translation. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Support "sample" filter action This change adds support for packet sampling using "psample" kernel module. Added PCPU and SA fields to XfrmState Add support for ARP/ND Timestamps when retriving neighbors On Linux, Netlink provides NDA_CACHEINFO which carries timestamps about when ARP/ND was updated, used, and confirmed. Expose these fields in the Neigh type tuntap: parse additional netlink attributes for flags and queues Signed-off-by: Ivan Tsvetkov <ivanfromearth@gmail.com> tuntap: add support for dynamically managing multi-queue FDs Introduce AddQueues and RemoveQueues methods for attaching and detaching queue file descriptors to an existing TUN/TAP interface in multi-queue mode. This enables controlled testing of disabled queues and fine-grained queue management without relying on interface recreation. Signed-off-by: Ivan Tsvetkov <ivanfromearth@gmail.com> add SRv6 support for END.DT4 fix: add missing CLOEXEC flag Some calls were already using it, some were not, but fix the remaining ones. Without this flag, the file descriptor would to the child process after fork/exec. Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> tests: Improve address unit test infrastructure Signed-off-by: kayos@tcp.direct <kayos@tcp.direct> addr_linux: don't require label to be prefixed with interface name This requirement limits the usefulness of labels (given the total label length can only be 15 characters). Signed-off-by: Julian Wiedmann <jwi@isovalent.com> feat: add IFLA_INET6_ADDR_GEN_MODE support geneve: Support setting/getting source port range Add support for geneve feature to specify source port range, see kernel commits: - e1f95b1992b8 ("geneve: Allow users to specify source port range") - 5a41a00cd5d5 ("geneve, specs: Add port range to rt_link specification") This is exactly equivalent on what is done in case of vxlan today. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> feat: add support for RtoMin lock veth: allow configuring peer attributes beyond namespace and address Signed-off-by: Gwendolyn <me@gwendolyn.dev> qdisc: fix wrong type info of tc_sfq_qopt Mimic `ipset` C code for determining correct default ipset revision Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io> bugfix: parse ipv4 src/dst error rdma: support rdma metrics: resource and statistic Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com> feat: add vlanid - tunnelid mapping support filter: add classid and port range support for flower vlan: add support for flags and qos maps Signed-off-by: Gwendolyn <me@gwendolyn.dev> Add support for the `expires` option of `ip route`
When the kernel detects a change in a set of resources being dumped (walked-over) by a netlink call, it sets flag
NLM_F_DUMP_INTR
to indicate that the results may be incomplete or inconsistent.Before #925 (in v1.2.1), the flag was ignored and results were returned without an error. With that change, response handling is aborted, results are discarded, and
unix.EINTR
is returned.Depending on what the caller is looking for, the incomplete/inconsistent results may be usable - and that may be preferable to retrying the whole request (which could take unbounded time on a system that's seeing a lot of changes in a lot of data).
This PR introduces a new error,
netlink.ErrDumpInterrupted
that's specific to theNLM_F_DUMP_INTR
case. Response handling completes as normal, and data is returned along with the error. So, the caller can use something likeerrors.Is(err, netlink.ErrDumpInterrupted)
, and take appropriate action.If any callers are checking for
errors.Is(err, unix.EINTR)
, that'll continue to work. But, this is a breaking change for any callers checking the error with a simpleerr == unix.EINTR
.