Skip to content

Conversation

yangwwei
Copy link
Contributor

@yangwwei yangwwei commented Aug 10, 2024

Summary

See proposal: https://github.com/ray-project/enhancements/blob/main/reps/2024-06-16-support-apache-yunikorn-scheduler.md.

For better batch scheduler integration, this PR adds another option batch-scheduler and user needs to explicitly provide a scheduler name, currently supported: volcano and yunikorn. The old flag enable-batch-scheduler will be deprecated but still be supported. Main changes:

  • Removed the global var: EnableBatchScheduler in raycluster_controller.go
  • Pass the configs to SchedulerManager, based on the configs, initiaize just one schedulerinterface.BatchScheduler and schedulerinterface.BatchSchedulerFactory, no need to keep a list, there will be just one scheduler support at runtime.
  • Simplify the logic in raycluster_controller.go, simply call scheduler manager to do scheduler related operations without checking the flag.

Test

  1. Build the operator:
make -C ray-operator build
  1. Test the old flag:

The old flag is marked as deprecated, prompts the message to ask using the new flag instead. But the flag is still supported.

 ./ray-operator/bin/manager -enable-batch-scheduler -leader-election-namespace default -use-kubernetes-proxy 

{"level":"info","ts":"2024-08-11T11:54:04.935-0700","logger":"setup","msg":"Feature flag enable-batch-scheduler is deprecated and will not be supported soon. Use batch-scheduler instead. "}
...

if PodGroup CRD is not installed, it will fail fast with the following error message:

panic: podGroup CRD is required to exist in current cluster. error: the server could not find the requested resource
goroutine 1 [running]:
github.com/ray-project/kuberay/ray-operator/controllers/ray.NewReconciler({0x105c2d858, _}, {_, _}, {{_, _, _}, {_, _, _}}, ...)
        /Users/weiweiyang/workspace/github/forks/kuberay/ray-operator/controllers/ray/raycluster_controller.go:114 +0x278
  1. Test the new config
  • batch-scheduler=volcano
./ray-operator/bin/manager -batch-scheduler volcano -leader-election-namespace default -use-kubernetes-proxy 

logs (some logs omitted). The log indicates "volcano" is used as the scheduler, and it loads the v1beta1.PodGroup resource, requires that CRD to be pre-installed.


{"level":"info","ts":"2024-08-11T11:56:31.688-0700","logger":"setup","msg":"Feature flag batch-scheduler is enabled","scheduler name":"volcano"}
...
{"level":"info","ts":"2024-08-11T11:58:34.756-0700","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.RayCluster"}
{"level":"info","ts":"2024-08-11T11:58:34.756-0700","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.Pod"}
{"level":"info","ts":"2024-08-11T11:58:34.756-0700","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.Service"}
{"level":"info","ts":"2024-08-11T11:58:34.756-0700","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1beta1.PodGroup"}
...
{"level":"info","ts":"2024-08-11T11:58:34.757-0700","logger":"controllers.RayJob","msg":"Starting Controller"}
{"level":"info","ts":"2024-08-11T11:58:34.873-0700","logger":"controllers.RayCluster","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-08-11T11:58:34.874-0700","logger":"controllers.RayService","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-08-11T11:58:34.874-0700","logger":"controllers.RayJob","msg":"Starting workers","worker count":1}
...
  1. Test new config
  • batch-scheduler=yunikorn
./ray-operator/bin/manager -batch-scheduler yunikorn -leader-election-namespace default -use-kubernetes-proxy 

logs (some logs omitted), note, no podGroup (from Volcano) related resource loaded in the controller:

{"level":"info","ts":"2024-08-11T12:00:22.148-0700","logger":"setup","msg":"Feature flag batch-scheduler is enabled","scheduler name":"yunikorn"}
...
{"level":"info","ts":"2024-08-11T12:00:39.359-0700","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.RayCluster"}
{"level":"info","ts":"2024-08-11T12:00:39.359-0700","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.Pod"}
{"level":"info","ts":"2024-08-11T12:00:39.359-0700","logger":"controllers.RayCluster","msg":"Starting EventSource","source":"kind source: *v1.Service"}
..
{"level":"info","ts":"2024-08-11T12:00:39.360-0700","logger":"controllers.RayService","msg":"Starting Controller"}
{"level":"info","ts":"2024-08-11T12:00:39.463-0700","logger":"controllers.RayCluster","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-08-11T12:00:39.464-0700","logger":"controllers.RayJob","msg":"Starting workers","worker count":1}
{"level":"info","ts":"2024-08-11T12:00:39.470-0700","logger":"controllers.RayService","msg":"Starting workers","worker count":1}
  1. If use both old and new option (not supported)
./ray-operator/bin/manager -enable-batch-scheduler -batch-scheduler yunikorn -leader-election-namespace default -use-kubernetes-proxy 

failed with the following error message:

{"level":"error","ts":"2024-08-11T12:02:04.914-0700","logger":"setup","msg":"do not use both options together: \"batch-scheduler\" and \"enable-batch-scheduler\".","error":"invalid configuration found","stacktrace":"main.exitOnError\n\t/Users/weiweiyang/workspace/github/forks/kuberay/ray-operator/main.go:271\nmain.main\n\t/Users/weiweiyang/workspace/github/forks/kuberay/ray-operator/main.go:165\nruntime.main\n\t/Users/weiweiyang/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.3.darwin-arm64/src/runtime/proc.go:271"}

@yangwwei yangwwei marked this pull request as draft August 10, 2024 06:12
@yangwwei
Copy link
Contributor Author

@kevin85421 can you help to review this draft? I will start to add more UTs once the overall PR looks good to you. Thanks!

@kevin85421
Copy link
Member

There are a lot of CI failures. Would you mind fixing the errors? I will review after I release v1.2.0.

@kevin85421 kevin85421 self-assigned this Aug 12, 2024
@yangwwei
Copy link
Contributor Author

yangwwei commented Aug 12, 2024

Got to know the problem, fixing the issues.

@yangwwei yangwwei marked this pull request as ready for review August 21, 2024 20:13
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. The KubeRay v1.2.0 release is taking longer than I expected. I'm back to reviewing PRs.

@@ -60,6 +60,11 @@ spec:
{{- if .Values.batchScheduler.enabled -}}
{{- $argList = append $argList "--enable-batch-scheduler" -}}
{{- end -}}
{{- if .Values.batchScheduler -}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move {{- if .Values.batchScheduler -}} above. That is,

            {{- if .Values.batchScheduler -}}
            {{- if .Values.batchScheduler.enabled -}}
            {{- $argList = append $argList "--enable-batch-scheduler" -}}
            {{- end -}}
            {{- if .Values.batchScheduler.name -}}
            {{- $argList = append $argList (printf "--batch-scheduler=%s" .Values.batchScheduler.name) -}}
            {{- end -}}
            {{- end -}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Changes made. Thanks!

@@ -57,6 +57,7 @@ readinessProbe:

batchScheduler:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments for batchScheduler to explain how to configure Volcano, Yunikorn, and the default scheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added some comments. Will add some docs too in the coming PR.

// init the batch scheduler manager
schedulerMgr, err := batchscheduler.NewSchedulerManager(rayConfigs, mgr.GetConfig())
if err != nil {
// fail fast if the scheduler plugin is failed to init
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// fail fast if the scheduler plugin is failed to init
// fail fast if the scheduler plugin fails to init

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

schedulerMgr, err := batchscheduler.NewSchedulerManager(rayConfigs, mgr.GetConfig())
if err != nil {
// fail fast if the scheduler plugin is failed to init
// prevent running the controller in vague state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// prevent running the controller in vague state
// prevent running the controller in an undefined state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@@ -160,8 +160,19 @@ func main() {
if forcedClusterUpgrade {
setupLog.Info("Deprecated feature flag forced-cluster-upgrade is enabled, which has no effect.")
}
if ray.EnableBatchScheduler {
setupLog.Info("Feature flag enable-batch-scheduler is enabled.")
if config.EnableBatchScheduler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a new function validateConfig and move L163 - L176 to the function instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea. Changes made.

@@ -51,7 +52,6 @@ type reconcileFunc func(context.Context, *rayv1.RayCluster) error

var (
DefaultRequeueDuration = 2 * time.Second
EnableBatchScheduler bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see that we removed this global variable!

// no scheduler provided
return &schedulerinterface.DefaultBatchScheduler{}, nil
}
batch.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding the lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checked the code, seems like there is no risk of running into concurrent issues. Since this will be called in the reconcile logic, which is streamlined. I think we can remove the lock.

@kevin85421
Copy link
Member

cc @andrewsykim Would you mind reviewing this PR since it also updates the config API?

@kevin85421 kevin85421 merged commit e1edb4c into ray-project:master Sep 8, 2024
27 checks passed
@yangwwei yangwwei deleted the batchscheduler branch September 8, 2024 07:13
@andrewsykim
Copy link
Member

Sorry I missed the review on this! My only comment would be that the "default" batch scheduler option seems redundant. It seems like that should just be "not set" instead

@yangwwei
Copy link
Contributor Author

yangwwei commented Sep 9, 2024

Sorry I missed the review on this! My only comment would be that the "default" batch scheduler option seems redundant. It seems like that should just be "not set" instead

Hi @andrewsykim just want to confirm. you mean we do not need a "batchScheduler.name=default" as the default value. By default, we do not set this option. If set, the only valid options are "volcano" and "yunikorn" for now. Which means in values.yaml, we will have a by default commented out option, something like this?

batchScheduler:

  # Deprecated. This option will be removed in the future.
  # Note, for backwards compatibility. When it sets to true, it enables volcano scheduler integration.
  enabled: false

  # Name of the scheduler, currently supported "volcano" and "yunikorn", do not set
  # "batchScheduler.enabled=true" at the same time as it will override this option.
  # name:  volcano | yunikorn

@andrewsykim
Copy link
Member

Yes, I think that would be better. The "default" value is misleading because it implies there is default logic for batch scheduling, but the default implementation is a no-op right?

@yangwwei
Copy link
Contributor Author

yangwwei commented Sep 9, 2024

Yes, I think that would be better. The "default" value is misleading because it implies there is default logic for batch scheduling, but the default implementation is a no-op right?

Sounds good. I will submit a follow up PR to address this. Thanks @andrewsykim
BTW, what's the time you guys think is the best to remove the old option "batchScheduler.enabled", we can keep it for a few releases, but need to be clear about when it will be removed. Thanks!

@yangwwei
Copy link
Contributor Author

@andrewsykim pls see #2371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants