Test stability regression

Running summary:
* Last prow update was 5/17 (https://github.com/istio/test-infra/commit/07e88ae5d09da7257ac863e3c5c052bbee754a84). No issues after this until 5/19.
* On 5/19 we updated to kind v0.11.0
* Around 5/19 night we started seeing elevated test failures. These were not isolated to a single failure or job, just generally things not working. A lot of them related to timeouts in api-server, etc.
* Assuming it was a regression in kind, we reverted kind v0.11.0 back to v0.10.0. The errors persisted.
* After analyzing the failures and the cluster, nothing stood out. Failures happened on ~all nodes, node resource utilization look normal, no errors, etc.
* We updated docker to 20.10.6, no improvement
* We updated the build cluster's k8s version. This was mostly to bounce all of the nodes. No improvement.
* We reverted to the exact docker image prior to kind v0.11.0 (rather than reverting the change in Dockerfile and rebuilding), same issues occurring 

Aside from the Kind v0.11.0 update, there were no known changes to any of our infrastructure

The failures seem to have one thing in common: containerd.

* We have seen buildkit fail to connect to local containerd socket (I think, anyhow) with [`grpc: the client connection is closing`](https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/33009/integ-distroless-k8s-tests_istio/1395738219563192320) and [`no active session for ssyocosy7viqzb53vwjkxf322: context deadline exceeded`](https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/32990/integ-telemetry-mc-k8s-tests_istio/1395435822206947328/build-log.txt)
* We have seen docker pushes to our local image registry fail [logs](https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/32996/integ-pilot-multicluster-tests_istio/1395738304959221760)
* We have seen containerd (in kind) fail to start the etcd container [logs](https://storage.googleapis.com/istio-prow/pr-logs/pull/istio_istio/32952/integ-pilot-multicluster-tests_istio/1395732447588519936/artifacts/kind/remote-control-plane/containerd.log)
* We have seen containerd (in kind) start the etcd container, but only after health check timeouts are hit [logs](https://prow.istio.io/view/gs/istio-prow/pr-logs/pull/istio_istio/32998/integ-security-multicluster-tests_istio/1395480913936125952)
* We have seen the kind cluster docker container fail to start with port conflicts [logs](https://prow.istio.io/view/gs/istio-prow/logs/integ-ipv6-k8s-tests_istio_postsubmit/1395443736476913664)


I have also checked our internal prow instance (which is a completely distinct build cluster, etc) and we do see the `grpc: the client connection is closing` errors; we don't run kind at all for those.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test stability regression #32985

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test stability regression #32985

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions