Skip to content

Volume GC Gets Wedged  #1960

@topherbullock

Description

@topherbullock

On wings.concourse.ci we had an incident where the web instances stopped responding.

  • In the logs, the ATCs were erroring with 'too many open files' trying to listen for http requests
  • Doing a lsof showed a lot of open connections to workers on port 7788 (Baggageclaim)
  • Observing the metrics, there was a spike in volumes and containers before everything went dark.
  • We restarted the ATCs with the --noop flag (no scheduling) , and it seemed to start GC-ing containers normally, but then it got to volumes and stopped.
  • Continually restarting the ATCs we observed that they would start the volume collector, destroy ( or attempt to destroy ) a handful of orphaned volumes, and then stop entirely.

To fix this we had to recreate all the workers ( to clean up all the volumes and containers left ) and then recreate the ATCs to reenable scheduling.

This issue led to #1959

  • Concourse version: 3.8.0-rc.2
  • Deployment type : BOSH
  • Infrastructure/IaaS: GCP
  • Did this used to work? 🤷‍♂️

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions