-
-
Notifications
You must be signed in to change notification settings - Fork 867
Closed
Labels
Description
Summary
2 (or more) tasks can land on the same worker even though we selected limit-active-task container placement strategy option where max-active-task-per-worker is set to 1.
This bug is probably introduced after this refactoring commit:
11a0216
In my understanding two conditions are needed for bug to occur:
(chooseTaskWorker
method in atc/worker/client.go
)
chosenWorker = client.pool.FindOrChooseWorkerForContainer
for the second task needs to happen before first task increases the number of active tasks on the same worker. In this case choose worker is not nil for the second task.- second task tries to acquire lock only after the first task released it already. Otherwise, we would sleep for 1 second , first task would already increased the number of active tasks and the bug would never occur
Bug occurs very rarely. For example, in the last three months we only experienced it around 15 times in our infra.
Steps to reproduce
Add:
CONCOURSE_CONTAINER_PLACEMENT_STRATEGY: limit-active-tasks
CONCOURSE_MAX_ACTIVE_TASKS_PER_WORKER: 1
to your docker-compose.yml
and run:
docker-compose \
-f ./docker-compose.yml \
-f ./hack/overrides/prometheus.yml \
up -d
Create a small pipeline template (run_job.yml
) i.e.:
---
jobs:
- name: loop
plan:
- task: loop
config:
platform: linux
image_resource:
type: registry-image
source:
repository: busybox
run:
path: sh
args:
- -c
- |
echo "Executing on worker: `hostname`"
for i in `seq 1 30`; do sleep 1; echo "Slept for ${i} seconds."; done
and executed the following shell script:
#!/bin/bash -ex
for i in `seq 1 60`; do fly -t ci sp -p parallel${i}-second-linux -c run_job.yml -n && fly -t ci up -p parallel${i}-second-linux; done
task(){
fly -t ci tj -j parallel${i}-second-linux/loop;
}
for i in `seq 1 60`; do
task "$i" &
done
Expected results
At all times, only a single task can land on the same worker (if max-active-task-per-worker is set to 1).
Actual results
2 tasks (or more) can land on the same worker
Triaging info
- Concourse version: 6.5.1
- Browser (if applicable):
- Did this used to work? Yes