Cache streamed volumes and try to get by looking for local cache #6495

evanchaoli · 2021-02-03T08:49:14Z

What does this PR accomplish?

Bug Fix | Feature | Documentation

closes #6218 .

This is a performance optimization PR. It will avoid get same version duplicately, which is not only reduce resource consumption of Concourse itself, but also reduce Concourse counterpart systems' workloads. E.g. if many jobs need to get a common git repo, then only the first job should do git clone, rest of jobs will just get the git resource version from local workers, which reduces git clones Concourse bringing to the Git system.

I think this could be an advantage that Concourse wins competitors.

I expect this feature plus the p2p stream feature should significantly improve large Concourse cluster's performance.

EDIT: more details is at #6495 (comment)

Changes proposed by this PR:

Mark streamed volumes as resource cache.
get step will try to find resource cache first, if found, then don't run get container any more.
Change the way how remote inputs are streamed-in. Originally, when an input is at remote, it will create a volume then stream-in remote volume's contents into the new volume. In that way, the new volume may be written by a task. This PR changed the way to: stream-in remote volume and create a COW volume on it, which make local and remote inputs are handled in same way. @vito, this is a big change, so I want your comment as well.

I also studied impacts of this feature to volume-GC. Looks like there is no impact. Because volume GC relies on container GC and resource cache GC. When a container is destroyed, corresponding volumes' container_id will be set to NULL, so to resource cache, then once a volumes all xxxx_ids are NULL, it will be GC-ed. As this feature doesn't touch container and resource cache lifecycle at all, volume-GC should not be impacted either.

Notes to reviewer:

Please help see if there are any potential issue with this solution.

I tested with the following pipeline:

resources:
- name: mygit
  type: git
  source:
    uri: <a git repo>
    branch: master
    private_key: ((chaol-gitlab-private-key))

jobs:
- name: test
  public: true
  plan:
  - get: mygit
  - in_parallel:
    - task: t1
      config:
        platform: linux
        image_resource:
          type: registry-image
          source: {repository: busybox}
        inputs:
        - name: mygit
        outputs:
        - name: mygit
        run:
          path: sh
          args:
           - -exc
           - |
             echo ok
             sleep 5
             ls mygit/
             date > mygit/test.1.`date +%m-%d-%H-%M-%S`
    - task: t2
      config:
        platform: linux
        image_resource:
          type: registry-image
          source: {repository: busybox}
        inputs:
        - name: mygit
        outputs:
        - name: mygit
        run:
          path: sh
          args:
           - -exc
           - |
             echo ok
             sleep 5
             ls mygit/
             date > mygit/test.2.`date +%m-%d-%H-%M-%S`
    - task: t3
      config:
        platform: linux
        image_resource:
          type: registry-image
          source: {repository: busybox}
        inputs:
        - name: mygit
        outputs:
        - name: mygit
        run:
          path: sh
          args:
           - -exc
           - |
             echo ok
             sleep 5
             ls mygit/
             date > mygit/test.3.`date +%m-%d-%H-%M-%S`
    - task: t4
      config:
        platform: linux
        image_resource:
          type: registry-image
          source: {repository: busybox}
        inputs:
        - name: mygit
        outputs:
        - name: mygit
        run:
          path: sh
          args:
           - -exc
           - |
             echo ok
             sleep 5
             ls mygit/
             date > mygit/test.4.`date +%m-%d-%H-%M-%S`
  - task: show-result
    config:
      platform: linux
      image_resource:
        type: registry-image
        source: {repository: busybox}
      inputs:
      - name: mygit
      run:
        path: sh
        args:
         - -exc
         - |
           ls mygit/

In the first run, mygit did git clone:

Then in the second run, git mygit did nothing but found the resource cache:

Release Note

was reverted Revert "feat(atc): cache streamed volumes and try to get by looking for local cache" #6659 due to multiple bugs found in testing

Contributor Checklist

Followed Code of conduct, Contributing Guide & avoided Anti-patterns
Signed all commits
Added tests (Unit and/or Integration)
Updated Documentation
Added release note (Optional)

Reviewer Checklist

Code reviewed
Tests reviewed
Documentation reviewed
Release notes reviewed
PR acceptance performed
New config flags added? Ensure that they are added to the
BOSH and
Helm packaging; otherwise, ignored for
the integration
tests
(for example, if they are Garden configs that are not displayed in the
--help text).

clarafu · 2021-02-11T14:40:26Z

@evanchaoli Sorry reviewing this PR is taking so long, I am new to the worker package and how all the code is intertwined so it is taking a bit of time to understand it! Just wanted to let you know that I am looking at it and it's not being ignored!

clarafu

This feature would be really cool! I also noticed a lot of test failures and I'm assuming that is just because you haven't gotten to fixing the tests with these changes?

atc/db/team.go

atc/db/resource_cache.go

evanchaoli · 2021-02-18T09:28:47Z

@evanchaoli Sorry reviewing this PR is taking so long, I am new to the worker package and how all the code is intertwined so it is taking a bit of time to understand it! Just wanted to let you know that I am looking at it and it's not being ignored!

No worries. Actually I just finished 2-week vacation for the Chinese New Year holiday, and back to work today. I believe this PR still need many more iterations to go.

evanchaoli · 2021-02-23T02:12:29Z

I also noticed a lot of test failures and I'm assuming that is just because you haven't gotten to fixing the tests with these changes?

@clarafu Yes, I haven't fixed the tests yet. Basically this PR changed Concourse behaviors like:

1.get will do nothing if a local cached volume can be found
2. The way how remote volumes streamed in when creating a container is changed.

With those changes, fixing tests is a big effort. So I want Concourse team to review these changes from design perspective first. Once we aligned with the direction, I'll spend time fixing tests and adding new tests.

evanchaoli · 2021-02-25T07:59:34Z

@clarafu So far, this PR is almost ready. Missing part is to add some tests for the new code. I'll add tests in next few days.

The changes in docker-compose.yml as well as those logs marked with "EVAN" will be cleaned up before merge. Please just ignore them when you review the code.

vito · 2021-02-26T14:58:28Z

Just chiming in to respond to this:

Change the way how remote inputs are streamed-in. Originally, when an input is at remote, it will create a volume then stream-in remote volume's contents into the new volume. In that way, the new volume may be written by a task. This PR changed the way to: stream-in remote volume and create a COW volume on it, which make local and remote inputs are handled in same way. @vito, this is a big change, so I want your comment as well.

Makes sense! We technically need to initialize a pristine cache and then attach a copy of it to the task so it can't modify it. As long as the task's copy-on-write volume expires with the build, leaving only the pristine cache around, this sounds correct. 👍

atc/db/team.go

… cache. Signed-off-by: Chao Li <chaol@vmware.com>

Signed-off-by: Chao Li <chaol@vmware.com>

…" to how this feature helps Signed-off-by: Chao Li <chaol@vmware.com>

Signed-off-by: Chao Li <chaol@vmware.com>

…bug. Signed-off-by: Chao Li <chaol@vmware.com>

evanchaoli · 2021-03-05T01:42:40Z

@clarafu Let me rephrase what is motivation of the PR, so you may notice the problems during your acceptance test.

First, let me describe what 7.0.0 behaves. In 7.0.0, @vito changed the image fetch process, task/put steps (get is the same if it doesn't use a builtin resource type) embeds check and get steps to fetch a docker image that the task/put needs. I'll just use a task as example:

get: image selected worker-1, so it fetched the docker image to worker-1.
And the task itself selected worker-2, so Concourse will stream the image volume from worker-1 to worker-2.

In pre-7.x, if a task selects a worker, the image will just be fetched to the worker. So we can see that, 7.0.0 not only fetches the image but also streaming the image, which is a little bit performance burden.

In build-2, get: image selected worker-3, so it fetched the image again to worker-3.
And the task also selected worker-2. Ideally, given worker-2 already ran build-1, it has the image already, there is no need to streaming the image from worker-3 once again. But as streamed volume is not marked as a resource cache, 7.0.0 will streamed the image volume from worker-3 to worker-2.

In build 3, get: image again selected worker-3. As worker-3 already fetched the image in build 2, it will not fetch the image again.
And the task itself again selected worker-2. Unfortunately, 7.0.0 will streaming the image again the image volume from worker-3 to worker-2.

In build 4, get: image selected worker-2. Though worker-2 has been streamed the image volume multiple time, it will still fetch the image again.
The task still selected worker-2, and the image has been fetched on worker-2, it will no longer need to streaming the image volume.

With these 4 builds, Concourse will fetch the same image 3 times, and streaming the image volume 3 times. So we can see, 7.0.0 will take more efforts to warm up docker images over all workers, which is a performance decrease.

The initial motivation of this PR is to solve the performance decrease. With this PR, let's see the behavior changes for the 4 builds.

Build 1 will behave the same, the image will be fetched to worker-1, and streamed to worker-2. But the change is, the image volume will be marked as a resource cache on worker-2.

Build 2: get: image will find the image on worker-1 and worker 2, so the get will do nothing. And as worker-2 already contains the image, it will not stream the image again.

Build 3: get: image will again find the image on worker-1 and worker-2, so the get will do nothing. And as worker-2 already contains the image, it will not stream the image again.

Build 4: As worker-2 has the image already, there will be neither image fetching nor streaming.

So with this PR, the same 4 builds will do only 1 image fetching and 1 image volume streaming.

This feature not only optimize Concourse performance, it will also reduce external communications, like image pulling, git clones:

Say a cluster has N workers, without this feature, an docker image may be pulled N times (plus unpredictable volume streamings) in order to warm up all workers. With this feature, the image will be pulled only once, and streamed N-1 time to get all workers warmed up.
Say a common git repo that is get by a lot of pipelines, then each build of those pipelines will do git clone. With this feature, only 1 git clone needs to be done, then Concourse will use local cache for later builds.

As get will do nothing if a resource cache can be found, which will save a lot of containers, so that optimizes worker's performance.

evanchaoli · 2021-03-05T01:52:43Z

@clarafu With the above explanation, for performance consideration, my plan is to skip 7.0.0, and use 7.x.0 (hopefully x is 1) where this feature is included.

clarafu · 2021-03-08T16:55:06Z

@evanchaoli oh silly me, you're totally right I did a yarn build in the wrong repo when I was running your changes

Also thanks for the detailed explanation! I do think the performance improvements in this PR are great and would love to get it into the product. I'm just trying to make sure it has as little bugs as possible because it is touching some pretty sensitive parts of the codebase and includes some pretty impactful behavioural changes so I'm sorry if it is taking longer to get merged

I just did some simple acceptance and everything seems to be looking good on my end! I think you can work on the finishing touches for the PR and it should be ready to be merged.

clarafu · 2021-03-08T18:03:32Z

Just an idea that is not necessary, it might be nice to show which worker the local resource cache the get step grabbed from? So it would also say selected worker: <worker-with-cache> in the build logs.

evanchaoli · 2021-03-09T07:55:39Z

Just an idea that is not necessary, it might be nice to show which worker the local resource cache the get step grabbed from? So it would also say selected worker: <worker-with-cache> in the build logs.

Notice that, get step actually doesn't grab a resource at all as long as it finds a local cache on some worker. There are may be multiple workers containing the local cache, the following step may grab a cache resource from any worker. Thus I don't think it makes much sense to show a worker name from which the resource can be grabbed.

evanchaoli · 2021-03-09T08:58:38Z

@clarafu I have cleaned up the code, removed my private logs and reverted changes in docker-compose.yml. Unless you come out some new comments, I think this PR is ready to merge.

clarafu · 2021-03-09T15:19:40Z

@evanchaoli Right there can be multiple workers that you find the resource cache on but there is only one worker that you choose and grab the volume off of, which the volume is then stored into the results of this step.

https://github.com/concourse/concourse/pull/6495/files#diff-b549ffbbf97e509ea25727b6638ba7327f12b5704909f27a1b17f5b89c364c2cR345

So if we were to display a worker name, it would be that worker. But it is not necessary to add if you don't feel like it is helpful.

clarafu

Thanks for the PR and addressing all my feedback! I just ran some simple acceptance tests against it and it looks good 👍

atc/exec/get_step.go

Signed-off-by: Chao Li <chaol@vmware.com>

…eFactory.ResourceCacheMetadata(). Signed-off-by: Chao Li <chaol@vmware.com>

…(), so that UI can see the get step's run time. Signed-off-by: Chao Li <chaol@vmware.com>

Signed-off-by: Chao Li <chaol@vmware.com>

clarafu

👍 👍

chenbh · 2021-03-11T21:09:45Z

@evanchaoli @clarafu

I'm not sure the exact reason why, but WorkerResourceCaches must be associated to a WorkerBaseResourceType. This wasn't a problem before since resource caches are only initialized during a get step (which can only run on linux workers with at least 1 base resource type). But now that just steaming in a volume can initialize a resource cache, Windows and Darwin workers are trying to register resource caches and erroring out.

clarafu · 2021-03-11T21:30:46Z

Right, so to summarize for that issue we are seeing for find or create container on worker <non-linux-worker>: worker base resource type disappeared this is happening because it is trying to InitializeResourceCache on non linux workers but when it comes time to find the worker base resource type on the worker

concourse/atc/db/worker_resource_cache.go

Lines 23 to 26 in 37c4aeb

    
           usedWorkerBaseResourceType, found, err := WorkerBaseResourceType{ 
        
           	Name:       baseResourceType.Name, 
        
           	WorkerName: workerResourceCache.WorkerName, 
        
           }.Find(tx)

it fails because the non-linux worker does not contain any base resource types.

clarafu · 2021-03-11T21:32:29Z

There were three errors we saw on our ci deployment. I took some screenshots of them

chenbh · 2021-03-11T21:36:39Z

From our convo on slack, it sounds like the resource cache is associated with a worker_base_resource_type_id is so that we can invalidate all caches if the worker gets bumped. If so then we can probably take the worker_base_resource_type_id from the source resource cache to populate the destination cache.

Also, I'm not 100% sure that last error was due to this PR, could just be an unfortunate flake as result of some other error.

evanchaoli · 2021-03-11T23:04:12Z

@clarafu Will you follow up this issue? Do you expect me to do something?

evanchaoli mentioned this pull request Feb 3, 2021

WIP: mark streamed volume as cache #6411

Closed

11 tasks

evanchaoli assigned clarafu Feb 3, 2021

evanchaoli added the enhancement label Feb 3, 2021

clarafu requested changes Feb 16, 2021

View reviewed changes

atc/db/team.go Show resolved Hide resolved

clarafu reviewed Feb 16, 2021

View reviewed changes

atc/db/resource_cache.go Outdated Show resolved Hide resolved

evanchaoli force-pushed the cache-streamed-2 branch 2 times, most recently from 036b3f1 to 213313d Compare February 22, 2021 09:27

evanchaoli requested a review from clarafu February 22, 2021 12:34

evanchaoli force-pushed the cache-streamed-2 branch 2 times, most recently from 065737b to 9a47373 Compare February 25, 2021 07:56

clarafu requested changes Feb 26, 2021

View reviewed changes

atc/db/team.go Show resolved Hide resolved

atc/db/team.go Outdated Show resolved Hide resolved

atc/db/team.go Outdated Show resolved Hide resolved

evanchaoli mentioned this pull request Mar 1, 2021

Simplify atc/worker package and extract runtime abstractions #6597

Merged

11 tasks

evanchaoli requested a review from clarafu March 1, 2021 02:01

evanchaoli added 11 commits March 1, 2021 13:44

feat(atc): cache streamed volumes and try to get by looking for local…

b660b73

… cache. Signed-off-by: Chao Li <chaol@vmware.com>

delete useless code

b0ab869

Signed-off-by: Chao Li <chaol@vmware.com>

changed the way how a remote input volume is streamed in.

cd25ab6

Signed-off-by: Chao Li <chaol@vmware.com>

update fakes.

981124c

Signed-off-by: Chao Li <chaol@vmware.com>

address review comments.

5816254

Signed-off-by: Chao Li <chaol@vmware.com>

comment out worker2 and worker3 to allow testflight to run.

158bc5b

Signed-off-by: Chao Li <chaol@vmware.com>

fixed existing unit tests.

77ed172

Signed-off-by: Chao Li <chaol@vmware.com>

added two metrics "get step cache hits" and "streamed resource caches…

8999d95

…" to how this feature helps Signed-off-by: Chao Li <chaol@vmware.com>

address review comment about using team id when find workers.

62b9854

Signed-off-by: Chao Li <chaol@vmware.com>

add worker compatibility matching of team and tags.

7482abc

Signed-off-by: Chao Li <chaol@vmware.com>

Change font color of "found from local cache" hint from warning to de…

2b1c277

…bug. Signed-off-by: Chao Li <chaol@vmware.com>

evanchaoli force-pushed the cache-streamed-2 branch from 2af5ea3 to 2b1c277 Compare March 1, 2021 05:46

evanchaoli requested a review from clarafu March 5, 2021 03:39

evanchaoli force-pushed the cache-streamed-2 branch from 3af1fe5 to bcc07de Compare March 9, 2021 08:55

clarafu approved these changes Mar 9, 2021

View reviewed changes

clarafu requested changes Mar 9, 2021

View reviewed changes

atc/exec/get_step.go Outdated Show resolved Hide resolved

evanchaoli added 2 commits March 10, 2021 13:35

tidy the code.

18aafac

Signed-off-by: Chao Li <chaol@vmware.com>

Delete resourceCache.LoadVersionMetadata(), replace with resourceCach…

4ca8368

…eFactory.ResourceCacheMetadata(). Signed-off-by: Chao Li <chaol@vmware.com>

evanchaoli force-pushed the cache-streamed-2 branch from bcc07de to 4ca8368 Compare March 10, 2021 05:38

In get step, when local cache is found, still to call delete.Starting…

1ff6854

…(), so that UI can see the get step's run time. Signed-off-by: Chao Li <chaol@vmware.com>

evanchaoli requested a review from clarafu March 10, 2021 06:12

fix unit tests.

1f10922

Signed-off-by: Chao Li <chaol@vmware.com>

clarafu approved these changes Mar 10, 2021

View reviewed changes

clarafu merged commit 5e4508c into concourse:master Mar 10, 2021

clarafu mentioned this pull request Mar 11, 2021

Revert "feat(atc): cache streamed volumes and try to get by looking for local cache" #6659

Merged

evanchaoli mentioned this pull request Mar 12, 2021

RE:feat(atc): cache streamed volumes and try to get by looking for local cache #6660

Merged

11 tasks

clarafu added release/no-impact This is an issue that never affected released versions (i.e. a regression caught prior to shipping). misc and removed enhancement labels Mar 15, 2021

clarafu changed the title ~~feat(atc): cache streamed volumes and try to get by looking for local cache~~ Cache streamed volumes and try to get by looking for local cache Mar 15, 2021

Uh oh!

Cache streamed volumes and try to get by looking for local cache #6495

Cache streamed volumes and try to get by looking for local cache #6495

Uh oh!

Conversation

evanchaoli commented Feb 3, 2021 • edited by clarafu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR accomplish?

Changes proposed by this PR:

Notes to reviewer:

Release Note

Contributor Checklist

Reviewer Checklist

Uh oh!

clarafu commented Feb 11, 2021

Uh oh!

clarafu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

evanchaoli commented Feb 18, 2021

Uh oh!

evanchaoli commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evanchaoli commented Feb 25, 2021

Uh oh!

vito commented Feb 26, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evanchaoli commented Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evanchaoli commented Mar 5, 2021

Uh oh!

clarafu commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clarafu commented Mar 8, 2021

Uh oh!

evanchaoli commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evanchaoli commented Mar 9, 2021

Uh oh!

clarafu commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clarafu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clarafu left a comment

Choose a reason for hiding this comment

Uh oh!

chenbh commented Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clarafu commented Mar 11, 2021

Uh oh!

clarafu commented Mar 11, 2021

Uh oh!

chenbh commented Mar 11, 2021

Uh oh!

evanchaoli commented Mar 11, 2021

Uh oh!

Uh oh!

evanchaoli commented Feb 3, 2021 •

edited by clarafu

Loading

evanchaoli commented Feb 23, 2021 •

edited

Loading

evanchaoli commented Mar 5, 2021 •

edited

Loading

clarafu commented Mar 8, 2021 •

edited

Loading

evanchaoli commented Mar 9, 2021 •

edited

Loading

clarafu commented Mar 9, 2021 •

edited

Loading

chenbh commented Mar 11, 2021 •

edited

Loading