-
-
Notifications
You must be signed in to change notification settings - Fork 866
Cache streamed volumes and try to get by looking for local cache #6495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@evanchaoli Sorry reviewing this PR is taking so long, I am new to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature would be really cool! I also noticed a lot of test failures and I'm assuming that is just because you haven't gotten to fixing the tests with these changes?
No worries. Actually I just finished 2-week vacation for the Chinese New Year holiday, and back to work today. I believe this PR still need many more iterations to go. |
036b3f1
to
213313d
Compare
@clarafu Yes, I haven't fixed the tests yet. Basically this PR changed Concourse behaviors like: 1. With those changes, fixing tests is a big effort. So I want Concourse team to review these changes from design perspective first. Once we aligned with the direction, I'll spend time fixing tests and adding new tests. |
065737b
to
9a47373
Compare
@clarafu So far, this PR is almost ready. Missing part is to add some tests for the new code. I'll add tests in next few days. The changes in |
Just chiming in to respond to this:
Makes sense! We technically need to initialize a pristine cache and then attach a copy of it to the task so it can't modify it. As long as the task's copy-on-write volume expires with the build, leaving only the pristine cache around, this sounds correct. 👍 |
… cache. Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
…" to how this feature helps Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
…bug. Signed-off-by: Chao Li <chaol@vmware.com>
2af5ea3
to
2b1c277
Compare
@clarafu Let me rephrase what is motivation of the PR, so you may notice the problems during your acceptance test. First, let me describe what 7.0.0 behaves. In 7.0.0, @vito changed the image fetch process,
In pre-7.x, if a task selects a worker, the image will just be fetched to the worker. So we can see that, 7.0.0 not only fetches the image but also streaming the image, which is a little bit performance burden. In build-2, In build 3, In build 4, With these 4 builds, Concourse will fetch the same image 3 times, and streaming the image volume 3 times. So we can see, 7.0.0 will take more efforts to warm up docker images over all workers, which is a performance decrease. The initial motivation of this PR is to solve the performance decrease. With this PR, let's see the behavior changes for the 4 builds. Build 1 will behave the same, the image will be fetched to worker-1, and streamed to worker-2. But the change is, the image volume will be marked as a resource cache on worker-2. Build 2: Build 3: Build 4: As worker-2 has the image already, there will be neither image fetching nor streaming. So with this PR, the same 4 builds will do only 1 image fetching and 1 image volume streaming. This feature not only optimize Concourse performance, it will also reduce external communications, like image pulling, git clones:
As |
@clarafu With the above explanation, for performance consideration, my plan is to skip 7.0.0, and use 7.x.0 (hopefully x is 1) where this feature is included. |
@evanchaoli oh silly me, you're totally right I did a yarn build in the wrong repo when I was running your changes Also thanks for the detailed explanation! I do think the performance improvements in this PR are great and would love to get it into the product. I'm just trying to make sure it has as little bugs as possible because it is touching some pretty sensitive parts of the codebase and includes some pretty impactful behavioural changes so I'm sorry if it is taking longer to get merged I just did some simple acceptance and everything seems to be looking good on my end! I think you can work on the finishing touches for the PR and it should be ready to be merged. |
Notice that, |
3af1fe5
to
bcc07de
Compare
@clarafu I have cleaned up the code, removed my private logs and reverted changes in |
@evanchaoli Right there can be multiple workers that you find the resource cache on but there is only one worker that you choose and grab the volume off of, which the volume is then stored into the results of this step. So if we were to display a worker name, it would be that worker. But it is not necessary to add if you don't feel like it is helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and addressing all my feedback! I just ran some simple acceptance tests against it and it looks good 👍
Signed-off-by: Chao Li <chaol@vmware.com>
…eFactory.ResourceCacheMetadata(). Signed-off-by: Chao Li <chaol@vmware.com>
bcc07de
to
4ca8368
Compare
…(), so that UI can see the get step's run time. Signed-off-by: Chao Li <chaol@vmware.com>
Signed-off-by: Chao Li <chaol@vmware.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍
I'm not sure the exact reason why, but |
Right, so to summarize for that issue we are seeing for concourse/atc/db/worker_resource_cache.go Lines 23 to 26 in 37c4aeb
it fails because the non-linux worker does not contain any base resource types. |
From our convo on slack, it sounds like the resource cache is associated with a worker_base_resource_type_id is so that we can invalidate all caches if the worker gets bumped. If so then we can probably take the Also, I'm not 100% sure that last error was due to this PR, could just be an unfortunate flake as result of some other error. |
@clarafu Will you follow up this issue? Do you expect me to do something? |
What does this PR accomplish?
Bug Fix | Feature | Documentation
closes #6218 .
This is a performance optimization PR. It will avoid
get
same version duplicately, which is not only reduce resource consumption of Concourse itself, but also reduce Concourse counterpart systems' workloads. E.g. if many jobs need to get a common git repo, then only the first job should dogit clone
, rest of jobs will just get thegit
resource version from local workers, which reducesgit clone
s Concourse bringing to the Git system.I think this could be an advantage that Concourse wins competitors.
I expect this feature plus the p2p stream feature should significantly improve large Concourse cluster's performance.
EDIT: more details is at #6495 (comment)
Changes proposed by this PR:
get
step will try to find resource cache first, if found, then don't runget
container any more.I also studied impacts of this feature to volume-GC. Looks like there is no impact. Because volume GC relies on container GC and resource cache GC. When a container is destroyed, corresponding volumes'
container_id
will be set to NULL, so to resource cache, then once a volumes all xxxx_ids are NULL, it will be GC-ed. As this feature doesn't touch container and resource cache lifecycle at all, volume-GC should not be impacted either.Notes to reviewer:
Please help see if there are any potential issue with this solution.
I tested with the following pipeline:
In the first run,
mygit
didgit clone
:Then in the second run,
git mygit
did nothing but found the resource cache:Release Note
Contributor Checklist
Reviewer Checklist
BOSH and
Helm packaging; otherwise, ignored for
the integration
tests
(for example, if they are Garden configs that are not displayed in the
--help
text).