-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
Feature Request
What challenge are you facing?
With changes in #1959 the ATC is now responsible for marking orphaned containers/volumes for GC and GC failed containers/volumes by itself. While in the marking phase, it scans the whole DB to come up with one list that has created but orphaned entries and one list of entries that is marked as destroying
already, it will then iterate through the created list and update each of it to be destroying
. This behaviour is not efficient for a large scale of concourse deployment that has multiple ATCs with hundreds of pipelines since each of them will try to do the same thing and locking up each other when doing the for loop for updating DB. This could slow down ATC heavily especially when ATC tries to recover from a deployment where huge amount of containers/volumes are waiting to be cleaned up. When that happens, we observed Wings has errors like resources not ticking with new version and max containers reached. Actually, on Wings the GC is much slower with multiple ATC nodes compared to only one ATC node.
A Modest Proposal
- Move the marking process out of the periodic GC at ATC and make it scoped to worker i.e. only finds and marks those entries that belong to given worker.
- Trigger above marking process only when worker request a list of entries to be destroyed (request made periodic at worker nodes still).
- When ATC finds and marks entries to be destroyed, do batch update on DB instead of a for loop.
Potential benefits:
- Make GC scale with numbers of ATC nodes. With more ATC nodes, the worker request that
list
destroying entries will be distributed into the ATC pool. - Much less locking competition across ATC nodes since each marking process will only scan and mark entries for given worker name.
- Batch update will also save ATC CPU cycles.