-
-
Notifications
You must be signed in to change notification settings - Fork 867
Closed
Description
Feature Request
What challenge are you facing?
Currently the ATC is responsible for destroying containers/volumes across workers. This is very network-intensive and error prone and difficult to parallelize while having reasonable resource consumption (it's easy to just have a swarm of connections lead to a 'too many open files' error). This happened to our large-scale Concourse instance, Wings, resulting in the whole server being dead and volumes leaking forever.
A Modest Proposal
Here's one idea:
- Don't have the ATC talk to workers to destroy containers/volumes - have its GC only mark them as 'destroying'.
- Add an API endpoint,
/api/v1/workers/<name>/sync
(or something). Details described after. - Add a TSA command,
sync
(or whatever we call this), which does aPOST
to the above endpoint with the worker's list of container/volume handles. - Any container/volume handles not included in the submitted list will be removed from the ATC's database.
- The API endpoint then returns the list of container/volume handles in
DESTROYING
state. This is passed on to the caller of thesync
command. - Add a process on the worker that periodically invokes this
sync
command (using the worker's private key as authorization), and destroys the returned containers/volumes.
This has quite a few benefits:
- Much fewer moving parts in the GC - it's all just database work now, making it much less prone to locking up.
- Easier to reason about worker parallelism, now that it's each worker doing its own slice. We'd just need a max-in-flight, rather than a fancy per-worker job queue.
- More effectively distributes work across the cluster; the ATC is no longer a bottleneck for removing all containers/volumes.
- This also fixes the unrecoverable cases where volumes/containers are removed out-of-band from the workers, leading to
unknown handle
errors - the initialPOST
will clear them out. Ref. Lots ofunknown handle
errors #1255, unknown handle ... after cleaning of the workers volumes #1305, Unknown handle makes pipeline unusable #1322, failed to find created volume in baggageclaim #1550, "unknown handle" errors after upgrading Concourse 2.6.0 to 3.3.4 #1721, unknown handle on repository. #1821. As a result this should help out with Investigation: non-BOSH worker operation lifecycle #1457.
biodigitalfish, phillbaker, dmlemos, william-tran, daniellavoie and 6 more