-
Notifications
You must be signed in to change notification settings - Fork 618
[RayJob][Feature] add light weight job submitter in kuberay image #2587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
6a0092f
to
9483127
Compare
Just to note that although both PRs can solve the duplicate submission issue, this lightweight committer can further shorten startup duration by a smaller image. |
Makes sense, but I'm concerned about kuberay operator image becoming a dependency at the cluster / job level. If we think this is worth doing, we should probably create a new image |
Signed-off-by: Rueian <rueiancsie@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
2bdb064
to
d274346
Compare
66302f3
to
d274346
Compare
Signed-off-by: Rueian <rueiancsie@gmail.com>
d274346
to
e3fb564
Compare
Hi @kevin85421, I have used a new GitHub action job to build a dedicated image for the submitter but the job requires credentials which I believe are only available after merging the PR. Do you have any suggested way to test the GitHub action job before merging the PR? Or probably we just merge it first? |
IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback |
The lightweight job submitter still has its own benefits (e.g., much faster image pulling), but I agree that we can revisit this based on the feedback from v1.3 to determine if the image pulling overhead of the K8s Job Submitter is problematic. If users always run the submitter on a K8s node that caches the Ray image, the lightweight submitter may not be necessary. |
@kevin85421 any news about this feature? It would be very useful on a AWS Fargate environment, because images are always pulled whenever a new Ray POD is created. |
@kevin85421 We have encountered similar problems recently. And triggers an exponential backoff retry of controller-runtime. Due to the long mirror pulling time of the submitter, the return of GetRayjobInfo is always 404 and is constantly re-queued. However, the retry interval after multiple times reaches 5 minutes. Even if the submitter is ready, the state of rayjob cr cannot flow (waiting for retry).I think it will affect the efficiency of task state machine flow, so I hope to improve the priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add more review tmr, thank you for the POC!
defer func() { _ = resp.Body.Close() }() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer func() { _ = resp.Body.Close() }() | |
defer func() { resp.Body.Close() }() |
} | ||
defer func() { _ = resp.Body.Close() }() | ||
|
||
body, _ := io.ReadAll(resp.Body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
body, _ := io.ReadAll(resp.Body) | |
body, err := io.ReadAll(resp.Body) | |
if err != nil { | |
... | |
} |
Why are these changes needed?
Currently, noted in the issue #2537, when a user comes with a
RayJob
CR, KubeRay uses the same image as the RayCluster to start another container to submit the Ray Job. However, if the container runs on a node without the image preloaded, it takes a long time to download the image and start since the image is usually large.This PR adds a light submitter (45MB) that mimics the
ray job submit
behavior (submit + tail logs) into the KubeRay image which is usually smaller than the image used in the RayCluster. Users can try it with thesubmitterPodTemplate
in their RayJob CR.Example RayJob CR yaml:
And, this submitter will not fail when the job has already been submitted thus will also solve #2154.
Related issue number
#2537
Checks