Skip to content

Inconsistency with executing 25+ tasks for concourse-kubernetes-resource in parallel #1907

@dharmesh-periscope

Description

@dharmesh-periscope

Hi there!
We are having trouble scaling this one job that uses the concourse-kubernetes-resource. This is a third-party resource, so I understand it maybe out of scope for the Concourse team, but if you have seen similar issues in the past, and can provide any insights, it will be very valuable for us to debug this issue. We have made some minor changes to the resource, but it mainly runs the kubectl set image command.

Scenario:
Our use case requires running 25+ tasks in aggregate. Timing for successful run varies from <2m to >30m. And we see the following inconsistent behavior otherwise:

  • The tasks gets stuck midway in execution, and don't recover. They end up with an interrupted message which we believe generally translates to timeouts.
  • It is unclear at what points the timeout window starts, since the task has 5 retries, and a lot of times the retries start and end right away.

Debugging:
We have tried the following things so far:

  • Experimented with various timeouts: 1m/2m/3m.
  • Tried executing all tasks in serial, but it takes a very long time, 30m or more.
  • Tried executing tasks in batches of 3/5/7, but we get the same inconsistent results.
  • Restarting workers help clear things up. But in a matter of 2/3 runs, it is back to its original behavior.

Can you suggest a way we can debug this issue? Your help is greatly appreciated. Thank you!
Dharmesh

Example configuration:

- name: jobs-staging
  serial: true
  plan:
  - aggregate:
    - timeout: 3m
       attempts: 5
       params:
         image_name: our-image/repository
         image_tag: version/number
       put: resource1-staging
    - timeout: 3m
       attempts: 5
       params:
         image_name: our-image/repository
         image_tag: version/number
       put: resource2-staging
    - timeout: 3m
       attempts: 5
       params:
         image_name: our-image/repository
         image_tag: version/number
       put: resource3-staging
  - aggregate:
    - timeout: 3m
       attempts: 5
       params:
         image_name: our-image/repository
         image_tag: version/number
       put: resource4-staging
    - timeout: 3m
       attempts: 5
       params:
         image_name: our-image/repository
         image_tag: version/number
       put: resource5-staging

- name: resource1-staging
  type: kubernetes
  source:
    <<: *staging-deployment
    resource_name: resource1
...

Bug Report

  • Concourse version: 3.8.0
  • Deployment type (BOSH/Docker/binary): Docker
  • Infrastructure/IaaS: Kubernetes 1.4 on AWS
  • Browser (if applicable): n/a
  • Did this used to work?: It works for lesser number of jobs in parallel.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions