volcano-devices unified config #3953

archlitchi · 2025-01-03T02:49:42Z

/kind feature

/area scheduling

After last Friday's discussion, we decided to use sync.once to initialize device config from configmap once during initializing. This Feature is used combined with latest version of 'https://github.com/Project-HAMi/volcano-vgpu-device-plugin'

This PR exposes vGPU resourceName as #3926 does, Using device--configMap is a better than using scheduler-configmap, since device-config can be accessed by volcano-vgpu-device-plugin.

To be noticed:

This feature is fully compatible with earlier versions, a default config will be generated if CM not found.
The information got by configs is not processed now, it will be used for future dynamic-mig implementation
This PR won't change the behavior of volcano-vgpu feature, dynamic-mig is not implemented in this PR
This PR fix issue:We'd better not depend on a kubeconfig file #3473

Copilot

Pull Request Overview

This PR implements a unified device configuration for volcano vgpu by loading settings from a dedicated configMap, ensuring compatibility and preparing for future dynamic MIG support.

Implement unified device config loading using sync.Once in pkg/scheduler/api/devices/config.
Update the vgpu device initialization and utility functions to use the unified configuration.
Enhance test coverage for YAML parsing of device configuration.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/scheduler/api/devices/config/config.go	Introduces device config loading from configMap and falls back to default config if necessary
pkg/scheduler/api/devices/config/config_test.go	Adds YAML parsing tests for the Nvidia config
pkg/scheduler/api/devices/config/vgpu.go	Defines vgpu-specific structures and configurations
pkg/scheduler/api/devices/util.go	Updates client initialization for devices
pkg/scheduler/api/devices/nvidia/vgpu/device_info.go	Adjusts vgpu device initialization to incorporate unified config loading
pkg/scheduler/api/devices/nvidia/vgpu/utils.go	Refactors utility functions, including a new geo extraction helper
docs/user-guide/how_to_use_volcano_vgpu.md	Updates documentation to include configuration customization

Copilot · 2025-03-21T03:02:41Z

pkg/scheduler/api/devices/config/config.go

+	}
+	data, ok := cm.Data["device-config.yaml"]
+	if !ok {
+		return nil, errors.New("data-config.yaml not found")


The error message refers to 'data-config.yaml' while the code checks for 'device-config.yaml'. Please update the error message to match the actual key being used.

Suggested change

return nil, errors.New("data-config.yaml not found")

return nil, errors.New("device-config.yaml not found")

Copilot · 2025-03-21T03:02:42Z

pkg/scheduler/api/devices/nvidia/vgpu/utils.go

 		Patch(context.Background(), node.Name, k8stypes.StrategicMergePatchType, bytes, metav1.PatchOptions{})
 	if err != nil {
 		klog.Errorf("patch pod %v failed, %v", node.Name, err)
 	}
 	return err
 }

+func extractGeoMetriyFromType(t string) ([]config.Geometry, error) {


[nitpick] Typo in the function name 'extractGeoMetriyFromType'; consider renaming it to 'extractGeometryFromType' for clarity and consistency.

Suggested change

func extractGeoMetriyFromType(t string) ([]config.Geometry, error) {

func extractGeometryFromType(t string) ([]config.Geometry, error) {

JesseStutler · 2025-03-21T02:36:06Z

docs/user-guide/how_to_use_volcano_vgpu.md

@@ -131,6 +131,10 @@ curl {volcano device plugin pod ip}:9394/metrics
 ```
 ![img](https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/doc/vgpu_device_plugin_metrics.png)

+### Customize
+
+You can customize volcano-vgpu feature by modifying configMap **volcano-vgpu-device-config**, that configMap is automatically deployed when you setup volcano-vgpu-device-plugin. You can change **resourceCountName**, **resourceMemoryName**, **gpuMemoryFactor**, **MigTemplates**, and others related to volcano-vgpu by editing that configMap, for more information, refer to volcano-vgpu-device-plugin.


refer to volcano-vgpu-device-plugin, maybe add a ref link here is better?

JesseStutler · 2025-03-24T01:33:52Z

pkg/scheduler/api/devices/config/config.go

+			return nil, err
+		}
+	}
+	data, ok := cm.Data["device-config.yaml"]


Extract device-config.yaml as a constant is better

JesseStutler · 2025-03-24T01:42:46Z

pkg/scheduler/api/devices/nvidia/vgpu/utils.go

+				Type:        items[3],
+				PodMap:      make(map[string]*GPUUsage),
+				Health:      health,
+				Mode:        "hami-core",


Also better to extract these modes to constant

okay, updated

JesseStutler · 2025-03-24T01:44:47Z

pkg/scheduler/api/devices/config/config.go

+}
+
+func InitDevicesConfig(cmName string) {
+	once.Do(func() {


Want to confirm that we only initialize once here? And if the user modifies cm and want to reload the config, they can only restart the scheduler, right?

yes, that's for current version, we will add a informer to watch this configMap in future version.

JesseStutler · 2025-03-24T01:45:20Z

Please also clean and squash your commit

Monokaix · 2025-04-07T07:13:43Z

pkg/scheduler/api/devices/util.go

+}
+
+// NewClient connects to an API server
+func NewClient() (*kubernetes.Clientset, error) {


clinetcmd.BuildConfigFromFlags also called InClusterCondig, so it's duplicated here.

Omitted this comment? @archlitchi

no, i have already fixed

Monokaix · 2025-04-07T07:17:21Z

pkg/scheduler/api/devices/config/config.go

+	return &yamlData, nil
+}
+
+func InitDevicesConfig(cmName string) {


Why not initialize it in the devicplugin?

they are initialized in device plugin as well, the config needs to be loaded both to vc-scheduler and volcano-vgpu-device-plugin

Monokaix · 2025-04-07T07:22:03Z

pkg/scheduler/api/devices/config/config.go

+
+const DeviceConfigFileName = "device-config.yaml"
+
+type Config struct {


Is this config only used in vgpu, if so, maybe put it in vgpu pkg is better.

they are designed to support multiple heterogeneous devices, not only vgpu

Monokaix · 2025-04-07T07:22:59Z

pkg/scheduler/api/devices/config/vgpu.go

+limitations under the License.
+*/
+
+package config


The exported fields in this file should add a comment.

Monokaix · 2025-04-07T07:24:20Z

pkg/scheduler/api/devices/nvidia/vgpu/utils.go

+				MigTemplate: []config.Geometry{},
+				MigUsage:    config.MigInUse{},
+			}
+			if len(items) > 5 {


5 is a magic number, we'd better explain why and add a sample data.

5 is the length of items before dynamic-mig feature

We should add a comment and give a sample data.

Monokaix · 2025-04-07T07:27:46Z

pkg/scheduler/api/devices/config/config.go

+		if err != nil {
+			configs = &Config{
+				NvidiaConfig: NvidiaConfig{
+					ResourceCountName:   "volcano.sh/vgpu-number",


There are constant definition of vgpu-memory and vgpu-member, we'd better re-use them.

Monokaix · 2025-04-07T07:28:18Z

pkg/scheduler/api/devices/config/config.go

+	return configs
+}
+
+func LoadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) {


Should this be exported?

Monokaix · 2025-04-08T11:42:18Z

pkg/scheduler/api/devices/nvidia/vgpu/utils.go

 		Patch(context.Background(), node.Name, k8stypes.StrategicMergePatchType, bytes, metav1.PatchOptions{})
 	if err != nil {
 		klog.Errorf("patch pod %v failed, %v", node.Name, err)
 	}
 	return err
 }

+func extractGeometryFromType(t string) ([]config.Geometry, error) {
+	for _, val := range config.GetConfig().NvidiaConfig.MigGeometriesList {


Should check config.GetConfig() not nil.

Monokaix · 2025-04-10T07:01:32Z

/approve

Monokaix · 2025-04-10T07:04:27Z

pkg/scheduler/api/devices/config/config.go

+	return configs
+}
+
+func loadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) {


We can get ns by downward API.

JesseStutler · 2025-04-10T12:25:53Z

pkg/scheduler/api/devices/config/config.go

+}
+
+func loadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) {
+	cm, err := kubeClient.CoreV1().ConfigMaps("kube-system").Get(context.Background(), cmName, metav1.GetOptions{})


Wait, why the device config cm is got from "kube-system" ns here?

because this cm is installed by 'volcano-vgpu-device-plugin', which default ns is 'kube-system'

JesseStutler · 2025-04-10T12:28:27Z

Hi, should only keep one commit, you need to rebase the latest master branch, there shouldn't be this Merge branch master into commit, thanks

william-wang · 2025-04-11T01:03:46Z

/approve

volcano-sh-bot · 2025-04-11T01:03:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix, william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [Monokaix,william-wang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: limengxuan <391013634@qq.com>

Monokaix · 2025-04-11T03:33:56Z

/lgtm

volcano-sh-bot added kind/feature Categorizes issue or PR as related to a new feature. area/scheduling labels Jan 3, 2025

volcano-sh-bot requested review from merryzhou, shinytang6 and william-wang January 3, 2025 02:49

volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 3, 2025

This was referenced Jan 9, 2025

feat: expose vGPU resource names as arguments #3926

Open

We'd better not depend on a kubeconfig file #3473

Closed

Monokaix requested a review from Copilot March 21, 2025 03:01

Copilot AI reviewed Mar 21, 2025

View reviewed changes

JesseStutler reviewed Mar 24, 2025

View reviewed changes

archlitchi force-pushed the uniconfig branch from 7ea559c to 337aad7 Compare March 25, 2025 07:54

volcano-sh-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 25, 2025

archlitchi force-pushed the uniconfig branch from cc88840 to 398501e Compare March 25, 2025 07:58

Monokaix reviewed Apr 7, 2025

View reviewed changes

archlitchi force-pushed the uniconfig branch 3 times, most recently from 6be54d3 to 61c02fa Compare April 8, 2025 08:15

Monokaix reviewed Apr 8, 2025

View reviewed changes

archlitchi force-pushed the uniconfig branch 2 times, most recently from 4b08d28 to 86ea6e2 Compare April 9, 2025 02:24

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2025

Monokaix reviewed Apr 10, 2025

View reviewed changes

archlitchi force-pushed the uniconfig branch from 65c5662 to e4b494f Compare April 10, 2025 07:41

JesseStutler reviewed Apr 10, 2025

View reviewed changes

update configs for dynamic-volcano

2ae6d46

Signed-off-by: limengxuan <391013634@qq.com>

archlitchi force-pushed the uniconfig branch from d7a92ea to 2ae6d46 Compare April 11, 2025 02:40

volcano-sh-bot assigned Monokaix Apr 11, 2025

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2025

volcano-sh-bot merged commit 8e60079 into volcano-sh:master Apr 11, 2025
16 checks passed

lowang-bh mentioned this pull request Jul 13, 2025

fix(scheduler): remove kubeconfig fallback in nodelock NewClient for … #4453

Closed

	return nil, errors.New("data-config.yaml not found")
	return nil, errors.New("device-config.yaml not found")

	func extractGeoMetriyFromType(t string) ([]config.Geometry, error) {
	func extractGeometryFromType(t string) ([]config.Geometry, error) {


		const DeviceConfigFileName = "device-config.yaml"

		type Config struct {

volcano-devices unified config #3953

volcano-devices unified config #3953

Uh oh!

Conversation

archlitchi commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JesseStutler commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Monokaix commented Apr 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

archlitchi commented Jan 3, 2025 •

edited

Loading