-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Description
Summary
Background:
In our current setup, some of our kustomize-based apps take a long time to generate the manifests.
E.g. our heaviest app kustomize build
takes 30s locally and up to 1m when executed in reposerver.
We're okay with being slow to deploy new changes (because we use heavy KRM functions with kustomize).
Problem:
However, rollbacks via the UI also take a long time (at least 2x longer), which is a problem for us.
This is strange because no new information is involved in rolling back to a previously experienced revision, so I'd presume reposerver should cache hit everything from redis.
Upon investigating, it turns out the main bit of rollback itself does indeed hit cache.
But two other steps involved in rollback via UI each lead to cache miss and trigger two full re-generation of manifests in reposerver, making the whole rollback process slow.
Investigation:
Process of a rollback via UI can be summarized as:
- User clicks and opens "HISTORY AND ROLLBACK".
- User browses and chooses the revision from the history list to rollback to.
Two requests are made for each browsed revision:GET revision metadata
: gives info about revision commit author and commit messagePOST appdetails
: returns info including list of images
- User clicks "Rollback" and accepts to disable auto-sync in order to proceed
In this process, couple of requests are made which include:PUT application
: to disable auto-syncPOST rollback
: to initiate the rollback itself
Cache miss (and thus regeneration of manifests) occurs in POST appdetails
and PUT application
.
-
PUT application
: Cache miss occurs at verifyGenerateManifests(), because the code always uses the latest commit of the repo as cache key, and commits stack up very frequently in a typical gitops monorepo setup:Lines 814 to 816 in a12f517
// Only check whether we can access the application's path, // and not whether it actually contains any manifests. _, err = repoClient.GenerateManifest(ctx, &req) -
POST appdetails
: Cache miss occurs because appdetails info incorporates data from generated manifests, but doesn't reuse the cache already set before by GenerateManifest().
Motivation
I think code can be improved so that the two cache misses above be avoided.
This would allow us to respond much more quickly to a faulty deployment by being able to rollback more quickly.
Proposal
First, cache miss in PUT application
occurs because the cache key changes too frequently.
But I think it can be avoided by just specifying ?validate=false
as query params to the PUT request at the frontend, thus skipping the call to verifyGenerateManifests() entirely.
There's no need to verify/validate when all we're doing is disabling auto-sync. This behavior would be consistent with the fact that the frontend already adds ?validate=false
to PUT requests triggered via the "ENABLE AUTO-SYNC" and "DISABLE AUTO-SYNC" buttons in the application details page.
Next, for the case of POST appdetails
, I think code in populateKustomizeAppDetails() can be improved to get manifest data from redis cache instead of calling Build().
But I do have to say I'm not too confident on my understanding of this part of the code, so I'm hoping someone could cross-check this or suggest a better option maybe.
Thanks!! 🤗