-
-
Notifications
You must be signed in to change notification settings - Fork 867
Closed
Labels
Description
On wings.concourse.ci we had an incident where the web instances stopped responding.
- In the logs, the ATCs were erroring with 'too many open files' trying to listen for http requests
- Doing a
lsof
showed a lot of open connections to workers on port7788
(Baggageclaim) - Observing the metrics, there was a spike in volumes and containers before everything went dark.
- We restarted the ATCs with the
--noop
flag (no scheduling) , and it seemed to start GC-ing containers normally, but then it got to volumes and stopped. - Continually restarting the ATCs we observed that they would start the volume collector, destroy ( or attempt to destroy ) a handful of orphaned volumes, and then stop entirely.
To fix this we had to recreate all the workers ( to clean up all the volumes and containers left ) and then recreate the ATCs to reenable scheduling.
This issue led to #1959
- Concourse version: 3.8.0-rc.2
- Deployment type : BOSH
- Infrastructure/IaaS: GCP
- Did this used to work? 🤷♂️