-
Notifications
You must be signed in to change notification settings - Fork 608
Add batch-scheduler option, deprecate enable-batch-scheduler option #2300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@kevin85421 can you help to review this draft? I will start to add more UTs once the overall PR looks good to you. Thanks! |
There are a lot of CI failures. Would you mind fixing the errors? I will review after I release v1.2.0. |
Got to know the problem, fixing the issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. The KubeRay v1.2.0 release is taking longer than I expected. I'm back to reviewing PRs.
@@ -60,6 +60,11 @@ spec: | |||
{{- if .Values.batchScheduler.enabled -}} | |||
{{- $argList = append $argList "--enable-batch-scheduler" -}} | |||
{{- end -}} | |||
{{- if .Values.batchScheduler -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move {{- if .Values.batchScheduler -}}
above. That is,
{{- if .Values.batchScheduler -}}
{{- if .Values.batchScheduler.enabled -}}
{{- $argList = append $argList "--enable-batch-scheduler" -}}
{{- end -}}
{{- if .Values.batchScheduler.name -}}
{{- $argList = append $argList (printf "--batch-scheduler=%s" .Values.batchScheduler.name) -}}
{{- end -}}
{{- end -}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Changes made. Thanks!
@@ -57,6 +57,7 @@ readinessProbe: | |||
|
|||
batchScheduler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments for batchScheduler
to explain how to configure Volcano, Yunikorn, and the default scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added some comments. Will add some docs too in the coming PR.
// init the batch scheduler manager | ||
schedulerMgr, err := batchscheduler.NewSchedulerManager(rayConfigs, mgr.GetConfig()) | ||
if err != nil { | ||
// fail fast if the scheduler plugin is failed to init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// fail fast if the scheduler plugin is failed to init | |
// fail fast if the scheduler plugin fails to init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
schedulerMgr, err := batchscheduler.NewSchedulerManager(rayConfigs, mgr.GetConfig()) | ||
if err != nil { | ||
// fail fast if the scheduler plugin is failed to init | ||
// prevent running the controller in vague state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// prevent running the controller in vague state | |
// prevent running the controller in an undefined state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
ray-operator/main.go
Outdated
@@ -160,8 +160,19 @@ func main() { | |||
if forcedClusterUpgrade { | |||
setupLog.Info("Deprecated feature flag forced-cluster-upgrade is enabled, which has no effect.") | |||
} | |||
if ray.EnableBatchScheduler { | |||
setupLog.Info("Feature flag enable-batch-scheduler is enabled.") | |||
if config.EnableBatchScheduler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating a new function validateConfig
and move L163 - L176 to the function instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea. Changes made.
@@ -51,7 +52,6 @@ type reconcileFunc func(context.Context, *rayv1.RayCluster) error | |||
|
|||
var ( | |||
DefaultRequeueDuration = 2 * time.Second | |||
EnableBatchScheduler bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to see that we removed this global variable!
// no scheduler provided | ||
return &schedulerinterface.DefaultBatchScheduler{}, nil | ||
} | ||
batch.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding the lock here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checked the code, seems like there is no risk of running into concurrent issues. Since this will be called in the reconcile logic, which is streamlined. I think we can remove the lock.
cc @andrewsykim Would you mind reviewing this PR since it also updates the |
Sorry I missed the review on this! My only comment would be that the "default" batch scheduler option seems redundant. It seems like that should just be "not set" instead |
Hi @andrewsykim just want to confirm. you mean we do not need a "batchScheduler.name=default" as the default value. By default, we do not set this option. If set, the only valid options are "volcano" and "yunikorn" for now. Which means in values.yaml, we will have a by default commented out option, something like this?
|
Yes, I think that would be better. The "default" value is misleading because it implies there is default logic for batch scheduling, but the default implementation is a no-op right? |
Sounds good. I will submit a follow up PR to address this. Thanks @andrewsykim |
@andrewsykim pls see #2371 |
Summary
See proposal: https://github.com/ray-project/enhancements/blob/main/reps/2024-06-16-support-apache-yunikorn-scheduler.md.
For better batch scheduler integration, this PR adds another option
batch-scheduler
and user needs to explicitly provide a scheduler name, currently supported: volcano and yunikorn. The old flagenable-batch-scheduler
will be deprecated but still be supported. Main changes:EnableBatchScheduler
inraycluster_controller.go
SchedulerManager
, based on the configs, initiaize just oneschedulerinterface.BatchScheduler
andschedulerinterface.BatchSchedulerFactory
, no need to keep a list, there will be just one scheduler support at runtime.raycluster_controller.go
, simply call scheduler manager to do scheduler related operations without checking the flag.Test
The old flag is marked as deprecated, prompts the message to ask using the new flag instead. But the flag is still supported.
if PodGroup CRD is not installed, it will fail fast with the following error message:
logs (some logs omitted). The log indicates "volcano" is used as the scheduler, and it loads the
v1beta1.PodGroup
resource, requires that CRD to be pre-installed.logs (some logs omitted), note, no podGroup (from Volcano) related resource loaded in the controller:
failed with the following error message: