Skip to content

Large Dynamic Applications resulting in stale resource state #8175

@sidewinder12s

Description

@sidewinder12s

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

I've got a set of very large Kubernetes clusters, but a fairly low number of ArgoCD Applications.
However, some of those applications are tracking daemonsets like Prometheus-operator where we could easily have 1000s of resources tracked within the app that scale extremely dynamically (adding or removing 100s of resources at a time)

Constantly, these larger applications will become stale/out of sync with the cluster. I'll open the app, it'll have old pods that no longer exist within the cluster. The only way I've found to clear this is to restart the argo-application-controllers. Performing a hard refresh on the app also does not clear these old resources.

Are there any performance recommendations for when you have very large single applications that are very dynamic? The HA docs
don't go into a ton of detail about what exactly each option effects, though the overall gist of the doc is you need to tune these when you have a large number of applications, nothing really speaking to when you have very large applications.

I haven't really found any error codes that stand out either from the Application Controller beyond this somewhat often:

To Reproduce

Stand up an application with 1000s of resources tracked and introduce a log of churn

Expected behavior

The Application Controller does not hold on to stale resources

Version

argocd: v2.1.5+a8a6fc8
  BuildDate: 2021-10-20T15:16:40Z
  GitCommit: a8a6fc8dda0e26bb1e0b893e270c1128038f5b0f
  GitTreeState: clean
  GoVersion: go1.16.5
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.1.5+a8a6fc8
  BuildDate: 2021-10-20T15:16:40Z
  GitCommit: a8a6fc8dda0e26bb1e0b893e270c1128038f5b0f
  GitTreeState: clean
  GoVersion: go1.16.5
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v4.2.0 2021-06-30T22:49:26Z
  Helm Version: v3.6.0+g7f2df64
  Kubectl Version: v0.21.0
  Jsonnet Version: v0.17.0```

We're also following most of the HA recommendations, including many of them around dealing with monorepos (using the annotations and webhook, multiple repo-server pods). Further we have plenty of headroom on our resource usage for all components.

argocd-application-controller:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: argocd-application-controller
spec:
  # Replicas and ARGOCD_CONTROLLER_REPLICAS env var need to match
  replicas: 8
  template:
    spec:
      containers:
      - name: argocd-application-controller
        command:
        - argocd-application-controller
        - --status-processors
        - "200"
        - --operation-processors
        - "100"
        - --repo-server-timeout-seconds
        - "180"
        - --redis
        - "argocd-redis-ha-haproxy:6379"
        env:
          - name: ARGOCD_CONTROLLER_REPLICAS
            value: '8'
        resources:
          requests:
            cpu: 4
            memory: 6Gi

Logs

Failed to get cached managed resources for tree reconciliation, fall back to full reconciliation

Metadata

Metadata

Assignees

Labels

breaking/highA possibly breaking change with high impactbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions