Skip to content

[RayJob] add RayJob pass Deadline e2e-test with retry #2241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 15, 2024

Conversation

karta1502545
Copy link
Contributor

Why are these changes needed?

Add test for rayjob_retry_test.go.

// test.T().Run("RayJob has passed ActiveDeadlineSeconds", func(_ *testing.T) {
// TODO: Add a test case to verify that the RayJob has passed ActiveDeadlineSeconds
// and ensure that the RayJob transitions to JobDeploymentStatusFailed
// regardless of the value of backoffLimit. Refer to rayjob_test.go for an example.
// })

Make sure RayJob status reaches to JobDeploymentStatusFailed after ActiveDeadlineSeconds regardless of retry times.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
$ kind delete cluster

$ sudo rm -rf bin

$ cd ray-operator

$ kind create cluster --image=kindest/node:v1.24.0

$ IMG=kuberay/operator:nightly make docker-build

$ kind load docker-image kuberay/operator:nightly

$ helm install kuberay-operator --set image.repository=kuberay/operator --set image.tag=nightly ../helm-chart/kuberay-operator

$ go test -timeout 30m -v ./test/e2e/rayjob_retry_test.go ./test/e2e/support.go 2>&1 | tee e2e-rayjob-test.log
--- PASS: TestRayJobRetry (175.48s)
    --- PASS: TestRayJobRetry/Failing_RayJob_without_cluster_shutdown_after_finished (90.70s)
    --- PASS: TestRayJobRetry/Failing_submitter_K8s_Job (77.66s)
    --- PASS: TestRayJobRetry/RayJob_has_passed_ActiveDeadlineSeconds (7.05s)
PASS
ok      command-line-arguments  176.472s

test.Expect(err).NotTo(HaveOccurred())
test.T().Logf("Created RayJob %s/%s successfully", rayJob.Namespace, rayJob.Name)

// The RayJob will transition to `Complete` because it has passed `ActiveDeadlineSeconds`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should say Failed not Complete

// regardless of the value of backoffLimit.
rayJobAC := rayv1ac.RayJob("long-running", namespace.Name).
WithSpec(rayv1ac.RayJobSpec().
WithBackoffLimit(2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should actually check that the backoffLimit behavior is not used when active deadline passed. You can check this by checking the status.succeeded and status.failed fields remain at 0

Copy link
Contributor Author

@karta1502545 karta1502545 Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. I have a question about status.failed

Why is status.failed 0? After the job times out, it should be marked as failed. Also, since it will not be allowed to retry, only 1 job would fail. Additionally, when I check the job status, the test output indicates status.failed should be 1.

test.T().Logf("Waiting for RayJob %s/%s to be 'Failed'", rayJob.Namespace, rayJob.Name)

test.Eventually(RayJob(test, rayJob.Namespace, rayJob.Name), TestTimeoutShort).
	Should(WithTransform(RayJobDeploymentStatus, Equal(rayv1.JobDeploymentStatusFailed)))
test.Expect(GetRayJob(test, rayJob.Namespace, rayJob.Name)).
	To(WithTransform(RayJobReason, Equal(rayv1.DeadlineExceeded)))

test.Expect(GetRayJob(test, rayJob.Namespace, rayJob.Name)).
	Should(WithTransform(RayJobFailed, Equal(int32(0))))
test.Expect(GetRayJob(test, rayJob.Namespace, rayJob.Name)).
	Should(WithTransform(RayJobSucceeded, Equal(int32(0))))
    rayjob_retry_test.go:157: 
        Expected
            <int32>: 1
        to equal
            <int32>: 0
=== NAME  TestRayJobRetry/RayJob_has_passed_ActiveDeadlineSeconds
    testing.go:1576: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the expected result is succeeded: 0 and failed: 1. When the RayJob passes the deadline, KubeRay will set the JobDeploymentStatus to JobDeploymentStatusFailed. status.failed will increment by 1 if JobDeploymentStatus is JobDeploymentStatusFailed.

Copy link
Contributor Author

@karta1502545 karta1502545 Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the additional information. Here is the code for the reference:

if rayJob.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusFailed {
failedCount++
}

@kevin85421 kevin85421 self-assigned this Jul 12, 2024
@karta1502545
Copy link
Contributor Author

Hi, I have added a status check to ensure that status.failed is 1 and status.succeeded is 0 after a RayJob timeout.
Thank you very much for your help!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@kevin85421 kevin85421 merged commit 1efaf68 into ray-project:master Jul 15, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants