-
-
Notifications
You must be signed in to change notification settings - Fork 866
Wait for worker matching strategy when scheduling build steps #6635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @multimac, thanks for the PR! Seems very useful. I do have some concerns about removing the locking around limit-active-tasks, though - it would introduce a regression to #6206. While it's true that it's more consistent with limit-active-{containers,volumes}, I think it's likely more impactful since, depending on the workload, having a couple more containers around is much less impactful than having a couple more very expensive tasks. If needed, I think we'd be better off just adding locking to limit-active-{containers,volumes} rather than removing it from limit-active-tasks. One alternative approach to the one taken that would allow for specialized behaviour for task steps (while still allowing for other steps to wait for a worker) is to add a func (d *taskDelegate) WaitForWorker(...) (worker.Client, error) {
if !strategy.ModifiesActiveTasks() {
return d.BuildStepDelegate.WaitForWorker(...)
}
...
} What do you think about this approach? More than happy to chat about alternative approaches |
d5dbc56
to
302fe0d
Compare
Hey @aoldershaw, thanks for the feedback. That's fair. What do you think of this commit, multimac@302fe0d I was thinking about it a bit more overnight and thought we could make I'm not sure how container and volume counts are tracked, but theoretically we could increment those counts in |
@multimac I'm a bit worried about acquiring a global lock for every step - on large deployments, this would result in a lot of lock contention. Frankly, grabbing a global lock for every We can possibly remove the need for the i.e. something like UPDATE workers
SET active_tasks = active_tasks + 1
WHERE name IN (...candidates...)
AND active_tasks < ...max_active_tasks... and then after a worker is selected: UPDATE workers
SET active_tasks = active_tasks - 1
WHERE name IN (...candidates...)
AND name <> ...selected worker... Then, re: container and volume counts, that's a bit different - they aren't incremented/decremented in the same way. Currently, Oh, and side note - I realize this PR changes of the behaviour of
Now, it's only doing 2, which I think makes more sense honestly. However, maybe we should add in a |
Yeah, that does sound a bit gross now you mention it. Instead of bumping the active task count for all containers, could we use I've pushed up a new commit 5f24ab2 tries that out. I had to add a Regarding the sorting of active containers, I think I removed this with the intent to add it back in but must have forgotten. I've put it back in in the commit mentioned above though. In that commit I also removed the logic from the I imagine there could be a scenario currently where if you configured a strategy chain like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your approach with Order
and Pick
makes things a lot nicer! I've left some suggestions
Awesome! Thanks for all the feedback. I don't know why my Google-fu didn't stumble over One quick note, I noticed this comment (by you, coincidentally)... Lines 114 to 118 in 5f24ab2
I think I could restructure the code a bit to check for a valid team worker first and then, if all team-specific workers are rejected by the strategy, fall back to checking the general worker pool. Do you have any thoughts on this? Would it maybe be out of scope of this PR? I can imagine a scenario where a team is assigned a single worker and when that worker fails the configured strategy then their jobs start failing (or stalling once this PR is merged). |
I think it'd be best to leave that alone for now, at least for this PR - I don't totally follow the reasoning for the current behaviour, but know it was done deliberately |
Can use this pipeline to test that the steps wait for a worker to be returned by the configured strategy, instead of erroring out like before. Tested using the following environment variables to set a limit on max active tasks...
|
The "limit-active-tasks" placement strategy requires some dedicated code (specifically, the "ModifiesActiveTasks" method) and special handling to ensure that there is a hard limit on active tasks in Concourse. Because of this, I'm temporarily removing this placement strategy to simplify the refactoring of container placement. I will add this functionality back in a later commit, though it may need to be altered slightly. Signed-off-by: David Symons <david@symons.io>
…ners on This commit takes the logic in "task_delegate.go" around waiting for the configured placement strategy to find a worker to schedule work on and moves it to "pool.go" so that it can be reused across all the different kinds of steps. The new method, "WaitForWorker", will execute the configured placement strategy periodically (currently every 5 seconds) until a worker is found. Signed-off-by: David Symons <david@symons.io>
This adds a new metric for tracking the number of steps currently waiting in "WaitForWorker" for the configured placement strategy to find a worker to schedule containers on. This should be a decent metric to determine when new worker nodes need to be created to handle pending jobs. Signed-off-by: David Symons <david@symons.io>
The tracking of active tasks was removed in 2446d09 to allow for the refactoring of the container placement logic. This commit adds that logic back into "task_step.go". Signed-off-by: David Symons <david@symons.io>
The "limit-active-tasks" placement strategy was temporarily removed during refactoring, and this commit adds it back. NOTE: The "limit-active-tasks" strategy is a bit less precise now, similar to "limit-active-containers" and "limit-active-volumes", as there is a small period of time between checking the active task count for a worker and picking / incrementing the active task count for the worker (if chosen). Should multiple tasks attempt to be scheduled at the same time, they may all pick the same worker and push the active task count for that worker above the configured limit. Signed-off-by: David Symons <david@symons.io>
This adds a new event, "waiting-for-worker", to notify the user when there is a delay in finding an appropriate worker to schedule the step on. Signed-off-by: David Symons <david@symons.io>
This adds a lock back to ensure that the "limit-active-tasks" strategy will not allow the number of active tasks to go beyond the configured limit. Signed-off-by: David Symons <david@symons.io>
…ctive-tasks" This removes the lock in worker.Pool. Instead of using a lock to ensure accurate tracking of active tasks, the "limit-active-tasks" strategy optimistically increases the active task count in "Pick" and, if that bumps the number of active tasks over the configured limit, returns an error there (while also decrementing the task count back). Signed-off-by: David Symons <david@symons.io>
Updates the placement and pool tests to test the new waiting behaviour. Signed-off-by: David Symons <david@symons.io>
Signed-off-by: David Symons <david@symons.io>
c01751c
to
973c19b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like what you've done, but have a few small comments! Thanks for your patience!
Signed-off-by: David Symons <david@symons.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again (again)! Hopefully the review process wasn't too painful, but I really appreciate all the work you put into addressing my concerns in a timely manner. Let's get this merged and see if we can weed out any other issues
FYI @multimac the latest |
No worries. Thanks for your help reviewing the changes. And the review process wasn't painful at all, was actually a very good experience. Hopefully I'll be able to do up some more PRs for Concourse in the future. And awesome! Didn't get a chance to throw them in our cluster at work today, but will hopefully get the chance to tomorrow. Really keen to try out this functionality and hopefully be able to downscale our pool of workers a little bit. |
Awesome! By the way, no pressure, but if you are interested in submitting more PRs in the future, you may want to add yourself to the contributors team by submitting a PR to concourse/governance (for example, this one). It doesn't change too much, but grants you access to the PRs pipeline (so you can rerun flaky builds, etc.), and access to the |
Signed-off-by: Bohan Chen <bochen@pivotal.io> Co-authored-by: Esteban Foronda <eforonda@vmware.com>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
the implementation is pretty much identical to #6635 with a couple exceptions: * Order for LimitActive{Containers,Volumes} now partitions the workers into two groups - those that satisfy the limit and those that don't. Those that don't are given lower priority, which should result in doing less work when Pick'ing a worker (since the workers that don't exceed the limit are attempted first) * I got rid of Name() and NoWorkerFitContainerPlacementStrategyError because they were unused - since Order doesn't filter out workers, that error would never be returned Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
What does this PR accomplish?
Bug Fix | Feature | Documentation
Fixes #6648
Changes proposed by this PR:
Similar to how the
task
step would wait for a worker to become available when using thelimit-active-tasks
strategy, this expands that functionality to all strategies (like the newly introducedlimit-active-containers
andlimit-active-volumes
), as well as applying it to all build step types (ie.check/get/put/task
). Additionally, this PR adds awaiting-for-worker
event to notify the user when a build is stuck waiting for a worker to become available.Notes to reviewer:
The existing
limit-active-tasks
strategy was a bit tricky to keep around in its current form, so I had to rewrite it and that has altered its behavior slightly. Before, it would lock and ensure that the number of active tasks never went above the configured limit. After the rewrite,limit-active-tasks
now behaves more likelimit-active-containers
/limit-active-volumes
in that it tries to cap the number of active tasks at the configured limit but may go over slightly due to race conditions.Also I still need to add some additional tests around the
WaitForWorker
method, but wanted to open this PR to get some initial feedback on the implementation and also the tweaks tolimit-active-tasks
mentioned above.Release Note
task
steps when using thelimit-active-tasks
placement strategy), the step would simply error the buildconcourse_tasks_waiting
was removed and replaced withconcourse_steps_waiting{type="task"}
Contributor Checklist
Reviewer Checklist
BOSH and
Helm packaging; otherwise, ignored for
the integration
tests
(for example, if they are Garden configs that are not displayed in the
--help
text).