Skip to content

[Bug] Minor inconsistency in RayJob submitter retries in v1.3 #3211

@andrewsykim

Description

@andrewsykim

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

#2579 changed the entrypoint used by the RayJob Job submitter to avoid the "duplicate submission IDs" error on retries. While this issue seems resolved, there is an inconsistency in behavior that could be problematic for some users.

In the event that a Ray job fails, the previous behavior was that the submitter Job makes 3 attempts at job submission:

  • Attempt 1: ray job submit -> tail logs -> error when job fails
  • Attempt 2: submit job -> error due to duplicate submission ID
  • Attempt 3: submit job -> error due to duplicate submission ID

In v1.3, I observe that the new behavior is:

  • Attempt 1: submit job -> tail logs -> error when job fails
  • Attempt 2: ray job status -> ray job logs -> completed

In Attempt 2, the pod successfully completes with exit code 0 because "ray job logs" returns exit 0 even if the job fails. Here's an example from my cluster:

$ kubectl get po
NAME                                                        READY   STATUS      RESTARTS   AGE
pytorch-text-classifier-6q6ff-x59k6                         0/1     Error       0          11m          # first attempt
pytorch-text-classifier-6q6ff-xqhk7                         0/1     Completed   0          10m.  # second attempt

What I expected was the submitter Job to stop after attempt 1 since that is a terminal state and retrying does not achieve anything.

Fortunately, this does not effect the RayJob status because we query the Ray dashboard to get job status.

$ kubectl get rayjob
NAME                            JOB STATUS   DEPLOYMENT STATUS   RAY CLUSTER NAME                                 START TIME             END TIME               AGE
pytorch-text-classifier-6q6ff   FAILED       Failed              pytorch-text-classifier-6q6ff-raycluster-mvg5r   2025-03-20T03:56:14Z   2025-03-20T03:57:31Z   13m

However, the misleading part is that the Kubernetes Job status is marked as succesful, where previously it was marked as failed:

$ kubectl get job
NAME                            STATUS     COMPLETIONS   DURATION   AGE
pytorch-text-classifier-6q6ff   Complete   1/1           66s        13m

To be clear, the new behavior is still a huge improvement because we no longer fail due to duplicate submission IDs. The new behavior is nice because submitter Job completion only indicates that the job was successfully submitted, not that the job itself was successful. However, this is still an inconsistent behavior from v1.2 I felt we should call out in case other users report it. If a user was relying on the submitter Job status for any reason (which they probably shouldn't), they could run into issues.

Reproduction script

  1. Create a RayJob that is going to fail
  2. Use spec.submitterConfig.backoffLimit = 2
  3. Observe there's only 1 retry

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions