Skip to content

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Nov 27, 2024

Why are these changes needed?

Currently, noted in the issue #2537, when a user comes with a RayJob CR, KubeRay uses the same image as the RayCluster to start another container to submit the Ray Job. However, if the container runs on a node without the image preloaded, it takes a long time to download the image and start since the image is usually large.

This PR adds a light submitter (45MB) that mimics the ray job submit behavior (submit + tail logs) into the KubeRay image which is usually smaller than the image used in the RayCluster. Users can try it with the submitterPodTemplate in their RayJob CR.

Example RayJob CR yaml:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  rayClusterSpec:
    ...
  submitterPodTemplate:
    spec:
      restartPolicy: Never
      containers:
        - name: my-custom-rayjob-submitter-pod
          image: kuberay/submitter:nightly
          args: ["--runtime-env-json", '{"pip":["requests==2.26.0","pendulum==2.1.2"],"env_vars":{"counter_name":"test_counter"}}', "--", "python", "/home/ray/samples/sample_code.py"]

And, this submitter will not fail when the job has already been submitted thus will also solve #2154.

Related issue number

#2537

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
image

@rueian rueian force-pushed the light-weight-job-submitter branch 20 times, most recently from 6a0092f to 9483127 Compare November 30, 2024 17:38
@rueian rueian marked this pull request as ready for review December 1, 2024 04:40
@andrewsykim
Copy link
Member

andrewsykim commented Dec 3, 2024

Are we planning to merge this PR and #2579?

I would suggest holding on this until we hear feedback about #2579 and whether it addresses concerns with RayJob

@rueian
Copy link
Contributor Author

rueian commented Dec 3, 2024

Just to note that although both PRs can solve the duplicate submission issue, this lightweight committer can further shorten startup duration by a smaller image.

@andrewsykim
Copy link
Member

Makes sense, but I'm concerned about kuberay operator image becoming a dependency at the cluster / job level. If we think this is worth doing, we should probably create a new image

Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian force-pushed the light-weight-job-submitter branch 2 times, most recently from 2bdb064 to d274346 Compare December 10, 2024 22:46
@rueian rueian force-pushed the light-weight-job-submitter branch 2 times, most recently from 66302f3 to d274346 Compare December 10, 2024 23:47
@rueian rueian force-pushed the light-weight-job-submitter branch from d274346 to e3fb564 Compare December 10, 2024 23:54
@rueian
Copy link
Contributor Author

rueian commented Dec 10, 2024

Hi @kevin85421, I have used a new GitHub action job to build a dedicated image for the submitter but the job requires credentials which I believe are only available after merging the PR.

image

Do you have any suggested way to test the GitHub action job before merging the PR? Or probably we just merge it first?

@andrewsykim
Copy link
Member

IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback

@kevin85421
Copy link
Member

IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback

The lightweight job submitter still has its own benefits (e.g., much faster image pulling), but I agree that we can revisit this based on the feedback from v1.3 to determine if the image pulling overhead of the K8s Job Submitter is problematic. If users always run the submitter on a K8s node that caches the Ray image, the lightweight submitter may not be necessary.

@ronaldo-valente-sgpiu
Copy link

IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback

The lightweight job submitter still has its own benefits (e.g., much faster image pulling), but I agree that we can revisit this based on the feedback from v1.3 to determine if the image pulling overhead of the K8s Job Submitter is problematic. If users always run the submitter on a K8s node that caches the Ray image, the lightweight submitter may not be necessary.

@kevin85421 any news about this feature? It would be very useful on a AWS Fargate environment, because images are always pulled whenever a new Ray POD is created.

@dushulin
Copy link
Contributor

dushulin commented Jul 14, 2025

@kevin85421 We have encountered similar problems recently. And triggers an exponential backoff retry of controller-runtime. Due to the long mirror pulling time of the submitter, the return of GetRayjobInfo is always 404 and is constantly re-queued. However, the retry interval after multiple times reaches 5 minutes. Even if the submitter is ready, the state of rayjob cr cannot flow (waiting for retry).I think it will affect the efficiency of task state machine flow, so I hope to improve the priority.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more review tmr, thank you for the POC!

Comment on lines +34 to +35
defer func() { _ = resp.Body.Close() }()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer func() { _ = resp.Body.Close() }()
defer func() { resp.Body.Close() }()

}
defer func() { _ = resp.Body.Close() }()

body, _ := io.ReadAll(resp.Body)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
body, _ := io.ReadAll(resp.Body)
body, err := io.ReadAll(resp.Body)
if err != nil {
...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants