Skip to content

Docker image jobs take a long time to schedule #1966

@avanier

Description

@avanier
  • Concourse version: 3.8.0
  • Deployment type (BOSH/Docker/binary): Binary
  • Infrastructure/IaaS: GCE

Our setup is composed of a minimum of 4 nodes in an autoscaling group. Nodes are configured as 1-standard-4 (4vcpu, 15GiB ram) with attached NVMe of 375GiB. The os is Ubuntu 16.04 LTS.

The following settings are of interest :

  • CONCOURSE_WORK_DIR is redirected to NVMe storage
  • CONCOURSE_BAGGAGECLAIM_DRIVER is set to overlay
  • CONCOURSE_BAGGAGECLAIM_VOLUMES is redirected to NVMe storage
  • CONCOURSE_BAGGAGECLAIM_OVERLAYS_DIR is redirected to NVMe storage
  • CONCOURSE_GARDEN_DESTROY_CONTAINERS_ON_STARTUP=true
  • CONCOURSE_GARDEN_DEPOT is redirected to NVMe storage
  • CONCOURSE_GARDEN_PROPERTIES_PATH is redirected to NVMe storage
  • CONCOURSE_GARDEN_ASSETS_DIR is redirected to NVMe storage
  • The NVMe is formatted with the following settings mkfs.xfs -f -n ftype=1 /dev/nvme0n1p1 and is mounted with noatime which is aligned with the Google storage recommendations for GCE
  • The NVMe partition is labeled GPT and is block aligned

We tested the NVMe on those instances and get performance in the ballpark of 330 MiB/s and 18k random IOPS on those kind of instances. We're hoping this should be sufficient to pull large images.

With the settings above, we are seeing certain tasks relying the Docker Image Resource where we know the images contain a lot of small files taking an average of 15 minutes before scheduling. The behaviour seems the consistent with what was observed in #1404 where a build will "hang" for a while in the loading state before announcing it's running.

unit-assets-web__4_-_concourse

We tried reverting to btrfs only to enjoy stalled workers and garden variety mayhem. (...see what I did there? 😛 )

This issue seems to be following in the wake of #1230, for which we had a meeting with @vito @jama-pivotal, and @topherbullock . (Thanks again for that guys! It's much appreciated!) I'm not hoping for a quick fix to this, but I'm hoping to outline that this situation is a quantifiable problem for the organizations where we're running Concourse. I am also volunteering whatever resources you believe might help us alleviate the problem.

As one would say... Halp!

cc @Typositoire @fkoclas @baptiste-bonnaudet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions