Skip to content

Test stability regression #32985

@howardjohn

Description

@howardjohn

Running summary:

  • Last prow update was 5/17 (istio/test-infra@07e88ae). No issues after this until 5/19.
  • On 5/19 we updated to kind v0.11.0
  • Around 5/19 night we started seeing elevated test failures. These were not isolated to a single failure or job, just generally things not working. A lot of them related to timeouts in api-server, etc.
  • Assuming it was a regression in kind, we reverted kind v0.11.0 back to v0.10.0. The errors persisted.
  • After analyzing the failures and the cluster, nothing stood out. Failures happened on ~all nodes, node resource utilization look normal, no errors, etc.
  • We updated docker to 20.10.6, no improvement
  • We updated the build cluster's k8s version. This was mostly to bounce all of the nodes. No improvement.
  • We reverted to the exact docker image prior to kind v0.11.0 (rather than reverting the change in Dockerfile and rebuilding), same issues occurring

Aside from the Kind v0.11.0 update, there were no known changes to any of our infrastructure

The failures seem to have one thing in common: containerd.

I have also checked our internal prow instance (which is a completely distinct build cluster, etc) and we do see the grpc: the client connection is closing errors; we don't run kind at all for those.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions