-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Closed
istio/test-infra
#3341Labels
Description
Running summary:
- Last prow update was 5/17 (istio/test-infra@07e88ae). No issues after this until 5/19.
- On 5/19 we updated to kind v0.11.0
- Around 5/19 night we started seeing elevated test failures. These were not isolated to a single failure or job, just generally things not working. A lot of them related to timeouts in api-server, etc.
- Assuming it was a regression in kind, we reverted kind v0.11.0 back to v0.10.0. The errors persisted.
- After analyzing the failures and the cluster, nothing stood out. Failures happened on ~all nodes, node resource utilization look normal, no errors, etc.
- We updated docker to 20.10.6, no improvement
- We updated the build cluster's k8s version. This was mostly to bounce all of the nodes. No improvement.
- We reverted to the exact docker image prior to kind v0.11.0 (rather than reverting the change in Dockerfile and rebuilding), same issues occurring
Aside from the Kind v0.11.0 update, there were no known changes to any of our infrastructure
The failures seem to have one thing in common: containerd.
- We have seen buildkit fail to connect to local containerd socket (I think, anyhow) with
grpc: the client connection is closing
andno active session for ssyocosy7viqzb53vwjkxf322: context deadline exceeded
- We have seen docker pushes to our local image registry fail logs
- We have seen containerd (in kind) fail to start the etcd container logs
- We have seen containerd (in kind) start the etcd container, but only after health check timeouts are hit logs
- We have seen the kind cluster docker container fail to start with port conflicts logs
I have also checked our internal prow instance (which is a completely distinct build cluster, etc) and we do see the grpc: the client connection is closing
errors; we don't run kind at all for those.