Skip to content

All workers busy when using container placement strategy limit-active-tasks in 7.0.0 #6648

@mkocus

Description

@mkocus

Summary

We are running a concourse environment with 1 web node and 3 workers using a web docker-compose config, which limits max running task per container to 1. One of the workers is using a tag to limit tasks that can be run on it.

      CONCOURSE_CONTAINER_PLACEMENT_STRATEGY: limit-active-tasks
      CONCOURSE_MAX_ACTIVE_TASKS_PER_WORKER: 1

We are experiencing an issue that didn't appear before updating to version 7.0.0 - workers can randomly become busy with some unknown tasks (nothing visible on a web UI), and no other job can be started:

All workers are busy at the moment, please stand-by.

This can happen on any worker. In particular, when it does happen on the one that uses a tag, it causes all jobs that require that tag to starve on the message shown above.
We have checked that required workers are indeed online (state: running).

fly -t ci workers
name                 containers  platform  tags  team  state    version  age
26a773549c67         46          linux     none  none  running  2.3      9d
Jans-Mac-mini.local  0           darwin    jan   none  running  2.3      7d
f0e17b5530aa         39          linux     none  none  running  2.3      10d

We have also checked using fly -t ci builds that there are no other blocking jobs. Only a single item, which is not a "check" resource is being shown:

1379786  some-project/some-step/114  started    2021-03-08@17:24:57+0100  n/a                       7m27s+    main  user

Steps to reproduce

Run 1 web node, 1 worker node, use the limit-active-tasks and 1 task per worker. After some time the worker becomes locked, and nothing can be runned on it. Restarting the web node and/or the worker node does not help.

Expected results

When there is no job running on the given worker, it should be possible to run a job on it, using limit-active-tasks and 1 task per worker.

Actual results

Web node says that All workers are busy at the moment, please stand-by., and no job can be run on the given worker.

Additional context

We are using garden, because containerd caused issues described in #6613.

Triaging info

  • Concourse version: 7.0.0
  • Browser (if applicable): N/A
  • Did this used to work? Yes, has been working on all previous 6.x.x versions

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions