-
Notifications
You must be signed in to change notification settings - Fork 615
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
#2579 changed the entrypoint used by the RayJob Job submitter to avoid the "duplicate submission IDs" error on retries. While this issue seems resolved, there is an inconsistency in behavior that could be problematic for some users.
In the event that a Ray job fails, the previous behavior was that the submitter Job makes 3 attempts at job submission:
- Attempt 1: ray job submit -> tail logs -> error when job fails
- Attempt 2: submit job -> error due to duplicate submission ID
- Attempt 3: submit job -> error due to duplicate submission ID
In v1.3, I observe that the new behavior is:
- Attempt 1: submit job -> tail logs -> error when job fails
- Attempt 2: ray job status -> ray job logs -> completed
In Attempt 2, the pod successfully completes with exit code 0 because "ray job logs" returns exit 0 even if the job fails. Here's an example from my cluster:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
pytorch-text-classifier-6q6ff-x59k6 0/1 Error 0 11m # first attempt
pytorch-text-classifier-6q6ff-xqhk7 0/1 Completed 0 10m. # second attempt
What I expected was the submitter Job to stop after attempt 1 since that is a terminal state and retrying does not achieve anything.
Fortunately, this does not effect the RayJob status because we query the Ray dashboard to get job status.
$ kubectl get rayjob
NAME JOB STATUS DEPLOYMENT STATUS RAY CLUSTER NAME START TIME END TIME AGE
pytorch-text-classifier-6q6ff FAILED Failed pytorch-text-classifier-6q6ff-raycluster-mvg5r 2025-03-20T03:56:14Z 2025-03-20T03:57:31Z 13m
However, the misleading part is that the Kubernetes Job status is marked as succesful, where previously it was marked as failed:
$ kubectl get job
NAME STATUS COMPLETIONS DURATION AGE
pytorch-text-classifier-6q6ff Complete 1/1 66s 13m
To be clear, the new behavior is still a huge improvement because we no longer fail due to duplicate submission IDs. The new behavior is nice because submitter Job completion only indicates that the job was successfully submitted, not that the job itself was successful. However, this is still an inconsistent behavior from v1.2 I felt we should call out in case other users report it. If a user was relying on the submitter Job status for any reason (which they probably shouldn't), they could run into issues.
Reproduction script
- Create a RayJob that is going to fail
- Use spec.submitterConfig.backoffLimit = 2
- Observe there's only 1 retry
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!