-
Notifications
You must be signed in to change notification settings - Fork 603
[RayJob] add RayJob pass Deadline e2e-test with retry #2241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob] add RayJob pass Deadline e2e-test with retry #2241
Conversation
test.Expect(err).NotTo(HaveOccurred()) | ||
test.T().Logf("Created RayJob %s/%s successfully", rayJob.Namespace, rayJob.Name) | ||
|
||
// The RayJob will transition to `Complete` because it has passed `ActiveDeadlineSeconds`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should say Failed
not Complete
// regardless of the value of backoffLimit. | ||
rayJobAC := rayv1ac.RayJob("long-running", namespace.Name). | ||
WithSpec(rayv1ac.RayJobSpec(). | ||
WithBackoffLimit(2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should actually check that the backoffLimit behavior is not used when active deadline passed. You can check this by checking the status.succeeded
and status.failed
fields remain at 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I have a question about status.failed
Why is status.failed
0? After the job times out, it should be marked as failed. Also, since it will not be allowed to retry, only 1 job would fail. Additionally, when I check the job status, the test output indicates status.failed
should be 1.
test.T().Logf("Waiting for RayJob %s/%s to be 'Failed'", rayJob.Namespace, rayJob.Name)
test.Eventually(RayJob(test, rayJob.Namespace, rayJob.Name), TestTimeoutShort).
Should(WithTransform(RayJobDeploymentStatus, Equal(rayv1.JobDeploymentStatusFailed)))
test.Expect(GetRayJob(test, rayJob.Namespace, rayJob.Name)).
To(WithTransform(RayJobReason, Equal(rayv1.DeadlineExceeded)))
test.Expect(GetRayJob(test, rayJob.Namespace, rayJob.Name)).
Should(WithTransform(RayJobFailed, Equal(int32(0))))
test.Expect(GetRayJob(test, rayJob.Namespace, rayJob.Name)).
Should(WithTransform(RayJobSucceeded, Equal(int32(0))))
rayjob_retry_test.go:157:
Expected
<int32>: 1
to equal
<int32>: 0
=== NAME TestRayJobRetry/RayJob_has_passed_ActiveDeadlineSeconds
testing.go:1576: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the expected result is succeeded: 0
and failed: 1
. When the RayJob passes the deadline, KubeRay will set the JobDeploymentStatus
to JobDeploymentStatusFailed
. status.failed
will increment by 1 if JobDeploymentStatus
is JobDeploymentStatusFailed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the additional information. Here is the code for the reference:
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 383 to 385 in cc94c6a
if rayJob.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusFailed { | |
failedCount++ | |
} |
Hi, I have added a status check to ensure that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Why are these changes needed?
Add test for rayjob_retry_test.go.
kuberay/ray-operator/test/e2e/rayjob_retry_test.go
Lines 83 to 87 in ea314d7
Make sure RayJob status reaches to JobDeploymentStatusFailed after ActiveDeadlineSeconds regardless of retry times.
Related issue number
Checks