Skip to content

Rollback via UI is slow where it doesn't have to be? #22898

@dlemfh

Description

@dlemfh

Summary

Background:
In our current setup, some of our kustomize-based apps take a long time to generate the manifests.
E.g. our heaviest app kustomize build takes 30s locally and up to 1m when executed in reposerver.

We're okay with being slow to deploy new changes (because we use heavy KRM functions with kustomize).

Problem:
However, rollbacks via the UI also take a long time (at least 2x longer), which is a problem for us.

This is strange because no new information is involved in rolling back to a previously experienced revision, so I'd presume reposerver should cache hit everything from redis.

Upon investigating, it turns out the main bit of rollback itself does indeed hit cache.
But two other steps involved in rollback via UI each lead to cache miss and trigger two full re-generation of manifests in reposerver, making the whole rollback process slow.

Investigation:
Process of a rollback via UI can be summarized as:

  1. User clicks and opens "HISTORY AND ROLLBACK".
  2. User browses and chooses the revision from the history list to rollback to.
    Two requests are made for each browsed revision:
    • GET revision metadata: gives info about revision commit author and commit message
    • POST appdetails: returns info including list of images
  3. User clicks "Rollback" and accepts to disable auto-sync in order to proceed
    In this process, couple of requests are made which include:
    • PUT application: to disable auto-sync
    • POST rollback: to initiate the rollback itself

Cache miss (and thus regeneration of manifests) occurs in POST appdetails and PUT application.

  • PUT application: Cache miss occurs at verifyGenerateManifests(), because the code always uses the latest commit of the repo as cache key, and commits stack up very frequently in a typical gitops monorepo setup:

    argo-cd/util/argo/argo.go

    Lines 814 to 816 in a12f517

    // Only check whether we can access the application's path,
    // and not whether it actually contains any manifests.
    _, err = repoClient.GenerateManifest(ctx, &req)

  • POST appdetails: Cache miss occurs because appdetails info incorporates data from generated manifests, but doesn't reuse the cache already set before by GenerateManifest().

Motivation

I think code can be improved so that the two cache misses above be avoided.
This would allow us to respond much more quickly to a faulty deployment by being able to rollback more quickly.

Proposal

First, cache miss in PUT application occurs because the cache key changes too frequently.

But I think it can be avoided by just specifying ?validate=false as query params to the PUT request at the frontend, thus skipping the call to verifyGenerateManifests() entirely.

There's no need to verify/validate when all we're doing is disabling auto-sync. This behavior would be consistent with the fact that the frontend already adds ?validate=false to PUT requests triggered via the "ENABLE AUTO-SYNC" and "DISABLE AUTO-SYNC" buttons in the application details page.

Next, for the case of POST appdetails, I think code in populateKustomizeAppDetails() can be improved to get manifest data from redis cache instead of calling Build().

But I do have to say I'm not too confident on my understanding of this part of the code, so I'm hoping someone could cross-check this or suggest a better option maybe.

Thanks!! 🤗

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions