-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
Hi there!
We are having trouble scaling this one job
that uses the concourse-kubernetes-resource
. This is a third-party resource, so I understand it maybe out of scope for the Concourse team, but if you have seen similar issues in the past, and can provide any insights, it will be very valuable for us to debug this issue. We have made some minor changes to the resource, but it mainly runs the kubectl set image
command.
Scenario:
Our use case requires running 25+ tasks in aggregate
. Timing for successful run varies from <2m
to >30m
. And we see the following inconsistent behavior otherwise:
- The tasks gets stuck midway in execution, and don't recover. They end up with an
interrupted
message which we believe generally translates totimeouts
. - It is unclear at what points the timeout window starts, since the task has 5 retries, and a lot of times the retries start and end right away.
Debugging:
We have tried the following things so far:
- Experimented with various timeouts:
1m
/2m
/3m
. - Tried executing all tasks in serial, but it takes a very long time,
30m
or more. - Tried executing tasks in batches of 3/5/7, but we get the same inconsistent results.
- Restarting workers help clear things up. But in a matter of 2/3 runs, it is back to its original behavior.
Can you suggest a way we can debug this issue? Your help is greatly appreciated. Thank you!
Dharmesh
Example configuration:
- name: jobs-staging
serial: true
plan:
- aggregate:
- timeout: 3m
attempts: 5
params:
image_name: our-image/repository
image_tag: version/number
put: resource1-staging
- timeout: 3m
attempts: 5
params:
image_name: our-image/repository
image_tag: version/number
put: resource2-staging
- timeout: 3m
attempts: 5
params:
image_name: our-image/repository
image_tag: version/number
put: resource3-staging
- aggregate:
- timeout: 3m
attempts: 5
params:
image_name: our-image/repository
image_tag: version/number
put: resource4-staging
- timeout: 3m
attempts: 5
params:
image_name: our-image/repository
image_tag: version/number
put: resource5-staging
- name: resource1-staging
type: kubernetes
source:
<<: *staging-deployment
resource_name: resource1
...
Bug Report
- Concourse version:
3.8.0
- Deployment type (BOSH/Docker/binary):
Docker
- Infrastructure/IaaS:
Kubernetes 1.4
onAWS
- Browser (if applicable):
n/a
- Did this used to work?: It works for lesser number of jobs in parallel.