-
Notifications
You must be signed in to change notification settings - Fork 6.4k
chore: Fix to intermittent E2E test failures in deployment_test.go #20974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Fix to intermittent E2E test failures in deployment_test.go #20974
Conversation
Signed-off-by: Jonathan West <jonwest@redhat.com>
❌ Preview Environment deleted from BunnyshellAvailable commands (reply to this comment):
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #20974 +/- ##
==========================================
+ Coverage 54.99% 55.04% +0.04%
==========================================
Files 324 324
Lines 55466 55466
==========================================
+ Hits 30505 30530 +25
+ Misses 22341 22322 -19
+ Partials 2620 2614 -6 ☔ View full report in Codecov by Sentry. |
return (err == nil && token != ""), nil | ||
}) | ||
require.NoError(t, waitErr) | ||
require.NotEmpty(t, token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what other process can modify it? It might be a sign of some race condition in ArgoCD, which we may need to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My expectation is that its a core Kubernetes controller that happens to modify the ServiceAccount
at the same time as Argo CD. Resource contention on shared resources is pretty common in cases like this. Retrying/re-reconciling on resource update error is a pretty standard controller/operator pattern.
token, err = clusterauth.GetServiceAccountBearerToken(KubeClientset, ns.Name, serviceAccountName, time.Second*60) | ||
|
||
// Success is no error and a real token, otherwise keep trying | ||
return (err == nil && token != ""), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we return err here, not nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning non-nil error here is fatal: it will stop the loop, as per PollUntilContextCancel
docs. But, in this case, if err
is non-nil, this is only a temporary failure, which is likely to be resolved by subsequent iterations. Thus it is not appropriate to return non-nil error from within the loop.
However, if the loop is not able to complete in the allotted timeframe, the context is canceled, and error is returned from the function, which will be caught and fail the require
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM!! Thanks @jgwest for fixing the flaky tests!!
…rgoproj#20974) Signed-off-by: Jonathan West <jonwest@redhat.com> Signed-off-by: Adrian Aneci <aneci@adobe.com>
…rgoproj#20974) Signed-off-by: Jonathan West <jonwest@redhat.com>
This PR is a simple test update to fix flaky E2E tests.
A couple of E2E tests (that I originally wrote some time ago) are intermittently failing due to a race condition in acquisition of the
ServiceAccount
secret.Both
TestArgoCDSupportsMultipleServiceAccountsWithDifferingRBACOnSameCluster
andTestDeployToKubernetesAPIURLWithQueryParameter
, indeployment_test.go
, attempt to acquire aServiceAccount
usingGetServiceAccountBearerToken
fromclusterauth.go
. However, within that package,createServiceAccountToken
can fail with the following error:failed to patch serviceaccount "e2e-test-user1-serviceaccount" with bearer token secret: Operation cannot be fulfilled on serviceaccounts "e2e-test-user1-serviceaccount": the object has been modified; please apply your changes to the latest version and try again
As you can probably guess, this is because another process is modifying the resource before our code can complete the patch.
The proposed fix is to retry on failure, up to a maximum of 20 seconds.
I have run these tests over and over for several hours, and did not hit the issue, so as best as I can tell the issue is resolved with this PR.
Example of failing run:
Checklist: