-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
Hello,
Since updating to the latest Concourse v7.12.0 release, we have been observing occasional task execution failures. We are using the containerd runtime on Linux 5.10 on amd64.
find or create container on worker REDACTED: starting task: network add: cni net setup: plugin type="loopback" failed (add): interrupted system call
The failures also sometimes afflict our resource get
steps. (I would assume this is more common, though we do not readily observe these failures.)
I do not know whether the problem can be reproduced on demand. The majority of builds succeed.
I am fairly confident this started with Concourse v7.12.0 but cannot be entirely certain. We had been using v7.11.2 since February 9, and we fire alerts on some build failures. I think I would have noticed this problem had it occurred on v7.11.2.
Our Concourse cluster executes on bare metal. Worker uptime is typically measured in units of weeks or months. This is all to say that there is little external interference.
What I have not yet been able to understand is that the change to the preemptive scheduler in Go happened a long time ago -- way back in 1.14 -- and most system call interfaces were already hardened with EINTR
retry loops. It is weird to only see this happen now.
{"timestamp":"2024-11-12T23:18:01.379566817Z","level":"error","source":"worker","message":"worker.garden.garden-server.create.failed","data":{"error":"starting task: network add: cni net setup: plugin type=\"loopback\" failed (add): interrupted system call","request":{"Handle":"59cdd84e-498d-48a9-5806-0b795a9ba406","GraceTime":0,"RootFSPath":"raw:///data/volumes/live/2ba00ba4-57a0-4c88-603e-cca16c983fe4/volume","BindMounts":[{"src_path":"/data/volumes/live/37c88e15-bcd4-4ff3-40eb-a75193fa31f3/volume","dst_path":"/scratch","mode":1}],"Network":"","Privileged":false,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{},"pid_limits":{}}},"session":"1.2.16112"}}
This EINTR
may be coming from the CNI plugin process.
(as opposed to some syscall executed by CNI just prior to the CNI plugin)
previously, on Concourse v7.11.2:
# /usr/local/concourse/bin/loopback
CNI loopback plugin v1.3.0
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
now, on Concourse v7.12.0:
# /usr/local/concourse/bin/loopback
CNI loopback plugin v1.6.0
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0, 1.1.0
Note from Concourse maintainers
We should keep track of this issue and the one's related to it: containernetworking/plugins#1121
From my quick reading of the upstream issues from Docker, it sounds like they wrap the netlink package and reproduce the original behaviour prior to v1.2.1 of the netlink package. We need to investigate if we can do the same or if this is something that containerd or CNI needs to do for us. Not sure what the effects would be if we try to handle this error ourselves.
Docker fixed this for themselves in this PR: moby/moby#48598
k8s issue tracking this problem as well: kubernetes/kubernetes#129562
Useful comment from that thread: kubernetes/kubernetes#129562 (comment)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status