Distribute container/volume garbage-collection across workers

# Feature Request

## What challenge are you facing?

Currently the ATC is responsible for destroying containers/volumes across workers. This is very network-intensive and error prone and difficult to parallelize while having reasonable resource consumption (it's easy to just have a swarm of connections lead to a 'too many open files' error). This happened to our large-scale Concourse instance, Wings, resulting in the whole server being dead and volumes leaking forever.

## A Modest Proposal

Here's one idea:

1. Don't have the ATC talk to workers to destroy containers/volumes - have its GC only mark them as 'destroying'.
1. Add an API endpoint, `/api/v1/workers/<name>/sync` (or something). Details described after.
1. Add a TSA command, `sync` (or whatever we call this), which does a `POST` to the above endpoint with the worker's list of container/volume handles.
1. Any container/volume handles *not* included in the submitted list will be removed from the ATC's database. 
1. The API endpoint then returns the list of container/volume handles in `DESTROYING` state. This is passed on to the caller of the `sync` command.
1. Add a process on the worker that periodically invokes this `sync` command (using the worker's private key as authorization), and destroys the returned containers/volumes.

This has quite a few benefits:

1. Much fewer moving parts in the GC - it's all just database work now, making it much less prone to locking up.
1. Easier to reason about worker parallelism, now that it's each worker doing its own slice. We'd just need a max-in-flight, rather than a fancy per-worker job queue.
1. More effectively distributes work across the cluster; the ATC is no longer a bottleneck for removing all containers/volumes.
1. This also fixes the unrecoverable cases where volumes/containers are removed out-of-band from the workers, leading to `unknown handle` errors - the initial `POST` will clear them out. Ref. #1255, #1305, #1322, #1550, #1721, #1821. As a result this should help out with #1457.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Distribute container/volume garbage-collection across workers #1959

Feature Request

What challenge are you facing?

A Modest Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Distribute container/volume garbage-collection across workers #1959

Description

Feature Request

What challenge are you facing?

A Modest Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions