Skip to content

Conversation

archlitchi
Copy link
Contributor

@archlitchi archlitchi commented Jan 3, 2025

/kind feature

/area scheduling

After last Friday's discussion, we decided to use sync.once to initialize device config from configmap once during initializing. This Feature is used combined with latest version of 'https://github.com/Project-HAMi/volcano-vgpu-device-plugin'

This PR exposes vGPU resourceName as #3926 does, Using device--configMap is a better than using scheduler-configmap, since device-config can be accessed by volcano-vgpu-device-plugin.

To be noticed:

  1. This feature is fully compatible with earlier versions, a default config will be generated if CM not found.
  2. The information got by configs is not processed now, it will be used for future dynamic-mig implementation
  3. This PR won't change the behavior of volcano-vgpu feature, dynamic-mig is not implemented in this PR
  4. This PR fix issue:We'd better not depend on a kubeconfig file #3473

@volcano-sh-bot volcano-sh-bot added kind/feature Categorizes issue or PR as related to a new feature. area/scheduling labels Jan 3, 2025
@volcano-sh-bot volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 3, 2025
@Monokaix Monokaix requested a review from Copilot March 21, 2025 03:01
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a unified device configuration for volcano vgpu by loading settings from a dedicated configMap, ensuring compatibility and preparing for future dynamic MIG support.

  • Implement unified device config loading using sync.Once in pkg/scheduler/api/devices/config.
  • Update the vgpu device initialization and utility functions to use the unified configuration.
  • Enhance test coverage for YAML parsing of device configuration.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/scheduler/api/devices/config/config.go Introduces device config loading from configMap and falls back to default config if necessary
pkg/scheduler/api/devices/config/config_test.go Adds YAML parsing tests for the Nvidia config
pkg/scheduler/api/devices/config/vgpu.go Defines vgpu-specific structures and configurations
pkg/scheduler/api/devices/util.go Updates client initialization for devices
pkg/scheduler/api/devices/nvidia/vgpu/device_info.go Adjusts vgpu device initialization to incorporate unified config loading
pkg/scheduler/api/devices/nvidia/vgpu/utils.go Refactors utility functions, including a new geo extraction helper
docs/user-guide/how_to_use_volcano_vgpu.md Updates documentation to include configuration customization

}
data, ok := cm.Data["device-config.yaml"]
if !ok {
return nil, errors.New("data-config.yaml not found")
Copy link
Preview

Copilot AI Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message refers to 'data-config.yaml' while the code checks for 'device-config.yaml'. Please update the error message to match the actual key being used.

Suggested change
return nil, errors.New("data-config.yaml not found")
return nil, errors.New("device-config.yaml not found")

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted

Patch(context.Background(), node.Name, k8stypes.StrategicMergePatchType, bytes, metav1.PatchOptions{})
if err != nil {
klog.Errorf("patch pod %v failed, %v", node.Name, err)
}
return err
}

func extractGeoMetriyFromType(t string) ([]config.Geometry, error) {
Copy link
Preview

Copilot AI Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Typo in the function name 'extractGeoMetriyFromType'; consider renaming it to 'extractGeometryFromType' for clarity and consistency.

Suggested change
func extractGeoMetriyFromType(t string) ([]config.Geometry, error) {
func extractGeometryFromType(t string) ([]config.Geometry, error) {

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted

@@ -131,6 +131,10 @@ curl {volcano device plugin pod ip}:9394/metrics
```
![img](https://github.com/Project-HAMi/volcano-vgpu-device-plugin/blob/main/doc/vgpu_device_plugin_metrics.png)

### Customize

You can customize volcano-vgpu feature by modifying configMap **volcano-vgpu-device-config**, that configMap is automatically deployed when you setup volcano-vgpu-device-plugin. You can change **resourceCountName**, **resourceMemoryName**, **gpuMemoryFactor**, **MigTemplates**, and others related to volcano-vgpu by editing that configMap, for more information, refer to volcano-vgpu-device-plugin.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to volcano-vgpu-device-plugin, maybe add a ref link here is better?

return nil, err
}
}
data, ok := cm.Data["device-config.yaml"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract device-config.yaml as a constant is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted

Type: items[3],
PodMap: make(map[string]*GPUUsage),
Health: health,
Mode: "hami-core",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also better to extract these modes to constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, updated

}

func InitDevicesConfig(cmName string) {
once.Do(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to confirm that we only initialize once here? And if the user modifies cm and want to reload the config, they can only restart the scheduler, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's for current version, we will add a informer to watch this configMap in future version.

@JesseStutler
Copy link
Member

Please also clean and squash your commit

@volcano-sh-bot volcano-sh-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 25, 2025
}

// NewClient connects to an API server
func NewClient() (*kubernetes.Clientset, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clinetcmd.BuildConfigFromFlags also called InClusterCondig, so it's duplicated here.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitted this comment? @archlitchi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i have already fixed

return &yamlData, nil
}

func InitDevicesConfig(cmName string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not initialize it in the devicplugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are initialized in device plugin as well, the config needs to be loaded both to vc-scheduler and volcano-vgpu-device-plugin


const DeviceConfigFileName = "device-config.yaml"

type Config struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this config only used in vgpu, if so, maybe put it in vgpu pkg is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are designed to support multiple heterogeneous devices, not only vgpu

limitations under the License.
*/

package config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exported fields in this file should add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted

MigTemplate: []config.Geometry{},
MigUsage: config.MigInUse{},
}
if len(items) > 5 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 is a magic number, we'd better explain why and add a sample data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 is the length of items before dynamic-mig feature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment and give a sample data.

if err != nil {
configs = &Config{
NvidiaConfig: NvidiaConfig{
ResourceCountName: "volcano.sh/vgpu-number",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are constant definition of vgpu-memory and vgpu-member, we'd better re-use them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted

return configs
}

func LoadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted

@archlitchi archlitchi force-pushed the uniconfig branch 3 times, most recently from 6be54d3 to 61c02fa Compare April 8, 2025 08:15
Patch(context.Background(), node.Name, k8stypes.StrategicMergePatchType, bytes, metav1.PatchOptions{})
if err != nil {
klog.Errorf("patch pod %v failed, %v", node.Name, err)
}
return err
}

func extractGeometryFromType(t string) ([]config.Geometry, error) {
for _, val := range config.GetConfig().NvidiaConfig.MigGeometriesList {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check config.GetConfig() not nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@archlitchi archlitchi force-pushed the uniconfig branch 2 times, most recently from 4b08d28 to 86ea6e2 Compare April 9, 2025 02:24
@Monokaix
Copy link
Member

/approve

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2025
return configs
}

func loadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get ns by downward API.

}

func loadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) {
cm, err := kubeClient.CoreV1().ConfigMaps("kube-system").Get(context.Background(), cmName, metav1.GetOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why the device config cm is got from "kube-system" ns here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this cm is installed by 'volcano-vgpu-device-plugin', which default ns is 'kube-system'

@JesseStutler
Copy link
Member

image
Hi, should only keep one commit, you need to rebase the latest master branch, there shouldn't be this Merge branch master into commit, thanks

@william-wang
Copy link
Member

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix, william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: limengxuan <391013634@qq.com>
@Monokaix
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2025
@volcano-sh-bot volcano-sh-bot merged commit 8e60079 into volcano-sh:master Apr 11, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/scheduling kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants