Skip to content

Distribute container/volume garbage-collection across workers #1959

@vito

Description

@vito

Feature Request

What challenge are you facing?

Currently the ATC is responsible for destroying containers/volumes across workers. This is very network-intensive and error prone and difficult to parallelize while having reasonable resource consumption (it's easy to just have a swarm of connections lead to a 'too many open files' error). This happened to our large-scale Concourse instance, Wings, resulting in the whole server being dead and volumes leaking forever.

A Modest Proposal

Here's one idea:

  1. Don't have the ATC talk to workers to destroy containers/volumes - have its GC only mark them as 'destroying'.
  2. Add an API endpoint, /api/v1/workers/<name>/sync (or something). Details described after.
  3. Add a TSA command, sync (or whatever we call this), which does a POST to the above endpoint with the worker's list of container/volume handles.
  4. Any container/volume handles not included in the submitted list will be removed from the ATC's database.
  5. The API endpoint then returns the list of container/volume handles in DESTROYING state. This is passed on to the caller of the sync command.
  6. Add a process on the worker that periodically invokes this sync command (using the worker's private key as authorization), and destroys the returned containers/volumes.

This has quite a few benefits:

  1. Much fewer moving parts in the GC - it's all just database work now, making it much less prone to locking up.
  2. Easier to reason about worker parallelism, now that it's each worker doing its own slice. We'd just need a max-in-flight, rather than a fancy per-worker job queue.
  3. More effectively distributes work across the cluster; the ATC is no longer a bottleneck for removing all containers/volumes.
  4. This also fixes the unrecoverable cases where volumes/containers are removed out-of-band from the workers, leading to unknown handle errors - the initial POST will clear them out. Ref. Lots of unknown handle errors #1255, unknown handle ... after cleaning of the workers volumes #1305, Unknown handle makes pipeline unusable #1322, failed to find created volume in baggageclaim #1550, "unknown handle" errors after upgrading Concourse 2.6.0 to 3.3.4 #1721, unknown handle on repository. #1821. As a result this should help out with Investigation: non-BOSH worker operation lifecycle #1457.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions