[Bug] Minor inconsistency in RayJob submitter retries in v1.3

### Search before asking

- [x] I searched the [issues](https://github.com/ray-project/kuberay/issues) and found no similar issues.


### KubeRay Component

ray-operator

### What happened + What you expected to happen

https://github.com/ray-project/kuberay/pull/2579 changed the entrypoint used by the RayJob Job submitter to avoid the "duplicate submission IDs" error on retries. While this issue seems resolved, there is an inconsistency in behavior that could be problematic for some users.

In the event that a Ray job fails, the previous behavior was that the submitter Job makes 3 attempts at job submission:
* Attempt 1: ray job submit -> tail logs -> error when job fails
* Attempt 2:  submit job -> error due to duplicate submission ID
* Attempt 3: submit job -> error due to duplicate submission ID

In v1.3, I observe that the new behavior is:
* Attempt 1: submit job -> tail logs -> error when job fails
* Attempt 2: ray job status -> ray job logs -> completed

In Attempt 2, the pod successfully completes with exit code 0 because "ray job logs" returns exit 0 even if the job fails. Here's an example from my cluster:
```
$ kubectl get po
NAME                                                        READY   STATUS      RESTARTS   AGE
pytorch-text-classifier-6q6ff-x59k6                         0/1     Error       0          11m          # first attempt
pytorch-text-classifier-6q6ff-xqhk7                         0/1     Completed   0          10m.  # second attempt
```

What I expected was the submitter Job to stop after attempt 1 since that is a terminal state and retrying does not achieve anything.

Fortunately, this does not effect the RayJob status because we query the Ray dashboard to get job status.
```
$ kubectl get rayjob
NAME                            JOB STATUS   DEPLOYMENT STATUS   RAY CLUSTER NAME                                 START TIME             END TIME               AGE
pytorch-text-classifier-6q6ff   FAILED       Failed              pytorch-text-classifier-6q6ff-raycluster-mvg5r   2025-03-20T03:56:14Z   2025-03-20T03:57:31Z   13m
```

However, the misleading part is that the Kubernetes Job status is marked as succesful, where previously it was marked as failed:
```
$ kubectl get job
NAME                            STATUS     COMPLETIONS   DURATION   AGE
pytorch-text-classifier-6q6ff   Complete   1/1           66s        13m
```

To be clear, the new behavior is still a huge improvement because we no longer fail due to duplicate submission IDs. The new behavior is nice because submitter Job completion only indicates that the job was successfully submitted, not that the job itself was successful. However, this is still an inconsistent behavior from v1.2 I felt we should call out in case other users report it. If a user was relying on the submitter Job status for any reason (which they probably shouldn't), they could run into issues.

### Reproduction script

1) Create a RayJob that is going to fail
2) Use spec.submitterConfig.backoffLimit = 2
3) Observe there's only 1 retry

### Anything else

_No response_

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Minor inconsistency in RayJob submitter retries in v1.3 #3211

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Minor inconsistency in RayJob submitter retries in v1.3 #3211

Description

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions