containerd task execution sometimes fails with EINTR - interrupted system call

Hello,

Since updating to the latest Concourse v7.12.0 release, we have been observing occasional task execution failures.  We are using the containerd runtime on Linux 5.10 on amd64.

> `find or create container on worker REDACTED: starting task: network add: cni net setup: plugin type="loopback" failed (add): interrupted system call`

The failures also sometimes afflict our resource `get` steps.  (I would assume this is more common, though we do not readily observe these failures.)

I do not know whether the problem can be reproduced on demand.  The majority of builds succeed.

I am fairly confident this started with Concourse v7.12.0 but cannot be entirely certain.  We had been using v7.11.2 since February 9, and we fire alerts on some build failures.  I think I would have noticed this problem had it occurred on v7.11.2.

Our Concourse cluster executes on bare metal.  Worker uptime is typically measured in units of weeks or months.  This is all to say that there is little external interference.

What I have not yet been able to understand is that the change to the preemptive scheduler in Go happened a long time ago -- way back in 1.14 -- and most system call interfaces were already [hardened](https://go.dev/doc/go1.15#ospkgos) with `EINTR` retry loops.  It is weird to only see this happen now.

> `{"timestamp":"2024-11-12T23:18:01.379566817Z","level":"error","source":"worker","message":"worker.garden.garden-server.create.failed","data":{"error":"starting task: network add: cni net setup: plugin type=\"loopback\" failed (add): interrupted system call","request":{"Handle":"59cdd84e-498d-48a9-5806-0b795a9ba406","GraceTime":0,"RootFSPath":"raw:///data/volumes/live/2ba00ba4-57a0-4c88-603e-cca16c983fe4/volume","BindMounts":[{"src_path":"/data/volumes/live/37c88e15-bcd4-4ff3-40eb-a75193fa31f3/volume","dst_path":"/scratch","mode":1}],"Network":"","Privileged":false,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{},"pid_limits":{}}},"session":"1.2.16112"}}`

This `EINTR` may be coming from the CNI plugin process.
(as opposed to some syscall executed by CNI just prior to the CNI plugin)

previously, on Concourse v7.11.2:

```
# /usr/local/concourse/bin/loopback
CNI loopback plugin v1.3.0
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
```

now, on Concourse v7.12.0:

```
# /usr/local/concourse/bin/loopback
CNI loopback plugin v1.6.0
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0, 1.1.0
```

---

## Note from Concourse maintainers

We should keep track of this issue and the one's related to it: https://github.com/containernetworking/plugins/issues/1121

From my quick reading of the upstream issues from Docker, it sounds like they wrap the netlink package and reproduce the original behaviour prior to v1.2.1 of the netlink package. We need to investigate if we can do the same or if this is something that containerd or CNI needs to do for us. Not sure what the effects would be if we try to handle this error ourselves.

Docker fixed this for themselves in this PR: https://github.com/moby/moby/pull/48598
k8s issue tracking this problem as well: https://github.com/kubernetes/kubernetes/issues/129562
Useful comment from that thread: https://github.com/kubernetes/kubernetes/issues/129562#issuecomment-2595550251

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

containerd task execution sometimes fails with EINTR - interrupted system call #9027

Note from Concourse maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

containerd task execution sometimes fails with EINTR - interrupted system call #9027

Description

Note from Concourse maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions