Skip to content

Conversation

jgwest
Copy link
Member

@jgwest jgwest commented Nov 27, 2024

This PR is a simple test update to fix flaky E2E tests.

A couple of E2E tests (that I originally wrote some time ago) are intermittently failing due to a race condition in acquisition of the ServiceAccount secret.

Both TestArgoCDSupportsMultipleServiceAccountsWithDifferingRBACOnSameCluster and TestDeployToKubernetesAPIURLWithQueryParameter, in deployment_test.go, attempt to acquire a ServiceAccount using GetServiceAccountBearerToken from clusterauth.go. However, within that package, createServiceAccountToken can fail with the following error:

failed to patch serviceaccount "e2e-test-user1-serviceaccount" with bearer token secret: Operation cannot be fulfilled on serviceaccounts "e2e-test-user1-serviceaccount": the object has been modified; please apply your changes to the latest version and try again

As you can probably guess, this is because another process is modifying the resource before our code can complete the patch.

The proposed fix is to retry on failure, up to a maximum of 20 seconds.

I have run these tests over and over for several hours, and did not hit the issue, so as best as I can tell the issue is resolved with this PR.

Example of failing run:

=== RUN   TestDeployToKubernetesAPIURLWithQueryParameter
time="2024-08-29T18:33:20Z" level=info msg="kubectl delete ns -l e2e.argoproj.io=true --field-selector status.phase=Active --wait=false" dir= execID=4bf1a
time="2024-08-29T18:33:20Z" level=debug msg="namespace \"argocd-e2e--test-deployment-without-tracking-mode-jlxqv\" deleted\n" duration=280.930217ms execID=4bf1a
time="2024-08-29T18:33:20Z" level=info msg="kubectl delete crd -l e2e.argoproj.io=true --wait=false" dir= execID=26fa6
time="2024-08-29T18:33:21Z" level=debug msg="No resources found\n" duration=736.63983ms execID=26fa6
time="2024-08-29T18:33:21Z" level=info msg="kubectl delete clusterroles -l e2e.argoproj.io=true --wait=false" dir= execID=a6188
time="2024-08-29T18:33:21Z" level=debug msg="No resources found\n" duration=220.337006ms execID=a6188
time="2024-08-29T18:33:22Z" level=warning msg="Failed to invoke grpc call. Use flag --grpc-web in grpc calls. To avoid this warning message, use flag --grpc-web."
time="2024-08-29T18:33:22Z" level=info msg="../../dist/argocd proj create gpg --server argocd-test-server-argocd-e2e.apps.rosa-qe-415.i1qm.p1.openshiftapps.com --auth-token eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJhcmdvY2QiLCJzdWIiOiJhZG1pbjpsb2dpbiIsImV4cCI6MTcyNTA0MjgwMiwibmJmIjoxNzI0OTU2NDAyLCJpYXQiOjE3MjQ5NTY0MDIsImp0aSI6IjUxNzU4OWFjLTQxMjktNDNkMS1hOTAwLWUxZmQzNWZkNjdlNyJ9.69GMMnp64BvgJZ7ZJaCRP2mqODctv44QyOMa2oGKgpI --insecure" dir= execID=3f5aa
time="2024-08-29T18:33:22Z" level=debug duration=132.070553ms execID=3f5aa
time="2024-08-29T18:33:22Z" level=info msg="mkdir -p /tmp/argo-e2e" dir= execID=effe0
time="2024-08-29T18:33:22Z" level=debug duration=1.970259ms execID=effe0
time="2024-08-29T18:33:22Z" level=info msg="mkdir -p /tmp/argo-e2e/gpg" dir= execID=f1823
time="2024-08-29T18:33:22Z" level=debug duration=1.607923ms execID=f1823
time="2024-08-29T18:33:22Z" level=info msg="chmod 0700 /tmp/argo-e2e/gpg" dir= execID=431be
time="2024-08-29T18:33:22Z" level=debug duration=1.778624ms execID=431be
time="2024-08-29T18:33:22Z" level=info msg="pkill -9 gpg-agent" dir= execID=0e848
time="2024-08-29T18:33:22Z" level=info msg="gpg --import ../fixture/gpg/signingkey.asc" dir= execID=91772
time="2024-08-29T18:33:22Z" level=debug duration=10.727778ms execID=91772
time="2024-08-29T18:33:22Z" level=info msg="cp -Rf testdata /tmp/argo-e2e/testdata.git" dir= execID=bad2d
time="2024-08-29T18:33:23Z" level=debug duration=265.091727ms execID=bad2d
time="2024-08-29T18:33:23Z" level=info msg="chmod 777 ." dir=/tmp/argo-e2e/testdata.git execID=528cb
time="2024-08-29T18:33:23Z" level=debug duration=1.88798ms execID=528cb
time="2024-08-29T18:33:23Z" level=info msg="git init -b master" dir=/tmp/argo-e2e/testdata.git execID=d7dd0
time="2024-08-29T18:33:23Z" level=debug msg="Initialized empty Git repository in /tmp/argo-e2e/testdata.git/.git/\n" duration=3.832245ms execID=d7dd0
time="2024-08-29T18:33:23Z" level=info msg="git add ." dir=/tmp/argo-e2e/testdata.git execID=90e4d
time="2024-08-29T18:33:23Z" level=debug duration=30.579905ms execID=90e4d
time="2024-08-29T18:33:23Z" level=info msg="git commit -q -m initial commit" dir=/tmp/argo-e2e/testdata.git execID=c1d5c
time="2024-08-29T18:33:23Z" level=debug duration=23.928072ms execID=c1d5c
time="2024-08-29T18:33:23Z" level=info msg="git remote add origin http://127.0.0.1:9081/argo-e2e/testdata.git" dir=/tmp/argo-e2e/testdata.git execID=ff654
time="2024-08-29T18:33:23Z" level=debug duration=2.302404ms execID=ff654
time="2024-08-29T18:33:23Z" level=info msg="git push origin master -f" dir=/tmp/argo-e2e/testdata.git execID=86119
time="2024-08-29T18:33:23Z" level=debug duration=137.342519ms execID=86119
time="2024-08-29T18:33:23Z" level=info msg="kubectl create ns argocd-e2e--test-deploy-to-kubernetes-apiurl-with-query-p-xcrix" dir= execID=d2123
time="2024-08-29T18:33:23Z" level=debug msg="namespace/argocd-e2e--test-deploy-to-kubernetes-apiurl-with-query-p-xcrix created\n" duration=140.73248ms execID=d2123
time="2024-08-29T18:33:23Z" level=info msg="kubectl label ns argocd-e2e--test-deploy-to-kubernetes-apiurl-with-query-p-xcrix e2e.argoproj.io=true" dir= execID=4897e
time="2024-08-29T18:33:23Z" level=debug msg="namespace/argocd-e2e--test-deploy-to-kubernetes-apiurl-with-query-p-xcrix labeled\n" duration=311.18895ms execID=4897e
time="2024-08-29T18:33:23Z" level=info msg="clean state" duration=3.781812414s id=TestDeployToKubernetesAPIURLWithQueryParameter-xcrix name=TestDeployToKubernetesAPIURLWithQueryParameter password=password username=admin
time="2024-08-29T18:33:23Z" level=info msg="ServiceAccount \"e2e-test-user1-serviceaccount\" created in namespace \"e2e-test-user1\""
time="2024-08-29T18:33:24Z" level=info msg="Created bearer token secret for ServiceAccount \"e2e-test-user1-serviceaccount\""
    deployment_test.go:313: 
            Error Trace:    /argocd-e2e/argo-cd/test/e2e/deployment_test.go:313
                                        /argocd-e2e/argo-cd/test/e2e/deployment_test.go:137
            Error:          Received unexpected error:
                            failed to patch serviceaccount "e2e-test-user1-serviceaccount" with bearer token secret: Operation cannot be fulfilled on serviceaccounts "e2e-test-user1-serviceaccount": the object has been modified; please apply your changes to the latest version and try again
            Test:           TestDeployToKubernetesAPIURLWithQueryParameter
--- FAIL: TestDeployToKubernetesAPIURLWithQueryParameter (3.99s){noformat}

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Signed-off-by: Jonathan West <jonwest@redhat.com>
@jgwest jgwest requested a review from a team as a code owner November 27, 2024 14:30
Copy link

bunnyshell bot commented Nov 27, 2024

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 55.04%. Comparing base (7f6340f) to head (5e0ab01).
Report is 414 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #20974      +/-   ##
==========================================
+ Coverage   54.99%   55.04%   +0.04%     
==========================================
  Files         324      324              
  Lines       55466    55466              
==========================================
+ Hits        30505    30530      +25     
+ Misses      22341    22322      -19     
+ Partials     2620     2614       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

return (err == nil && token != ""), nil
})
require.NoError(t, waitErr)
require.NotEmpty(t, token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what other process can modify it? It might be a sign of some race condition in ArgoCD, which we may need to fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation is that its a core Kubernetes controller that happens to modify the ServiceAccount at the same time as Argo CD. Resource contention on shared resources is pretty common in cases like this. Retrying/re-reconciling on resource update error is a pretty standard controller/operator pattern.

token, err = clusterauth.GetServiceAccountBearerToken(KubeClientset, ns.Name, serviceAccountName, time.Second*60)

// Success is no error and a real token, otherwise keep trying
return (err == nil && token != ""), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we return err here, not nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning non-nil error here is fatal: it will stop the loop, as per PollUntilContextCancel docs. But, in this case, if err is non-nil, this is only a temporary failure, which is likely to be resolved by subsequent iterations. Thus it is not appropriate to return non-nil error from within the loop.

However, if the loop is not able to complete in the allotted timeframe, the context is canceled, and error is returned from the function, which will be caught and fail the require

@andrii-korotkov-verkada andrii-korotkov-verkada added the ready-for-review An approver should give a final review and merge the PR label Nov 29, 2024
Copy link
Member

@ishitasequeira ishitasequeira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM!! Thanks @jgwest for fixing the flaky tests!!

@ishitasequeira ishitasequeira merged commit 9587ec9 into argoproj:master Dec 3, 2024
35 checks passed
adriananeci pushed a commit to adriananeci/argo-cd that referenced this pull request Dec 4, 2024
…rgoproj#20974)

Signed-off-by: Jonathan West <jonwest@redhat.com>
Signed-off-by: Adrian Aneci <aneci@adobe.com>
revitalbarletz pushed a commit to revitalbarletz/argo-cd that referenced this pull request Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-review An approver should give a final review and merge the PR
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants