-
Notifications
You must be signed in to change notification settings - Fork 1.2k
volcano-devices unified config #3953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a unified device configuration for volcano vgpu by loading settings from a dedicated configMap, ensuring compatibility and preparing for future dynamic MIG support.
- Implement unified device config loading using sync.Once in pkg/scheduler/api/devices/config.
- Update the vgpu device initialization and utility functions to use the unified configuration.
- Enhance test coverage for YAML parsing of device configuration.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pkg/scheduler/api/devices/config/config.go | Introduces device config loading from configMap and falls back to default config if necessary |
pkg/scheduler/api/devices/config/config_test.go | Adds YAML parsing tests for the Nvidia config |
pkg/scheduler/api/devices/config/vgpu.go | Defines vgpu-specific structures and configurations |
pkg/scheduler/api/devices/util.go | Updates client initialization for devices |
pkg/scheduler/api/devices/nvidia/vgpu/device_info.go | Adjusts vgpu device initialization to incorporate unified config loading |
pkg/scheduler/api/devices/nvidia/vgpu/utils.go | Refactors utility functions, including a new geo extraction helper |
docs/user-guide/how_to_use_volcano_vgpu.md | Updates documentation to include configuration customization |
} | ||
data, ok := cm.Data["device-config.yaml"] | ||
if !ok { | ||
return nil, errors.New("data-config.yaml not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message refers to 'data-config.yaml' while the code checks for 'device-config.yaml'. Please update the error message to match the actual key being used.
return nil, errors.New("data-config.yaml not found") | |
return nil, errors.New("device-config.yaml not found") |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopted
Patch(context.Background(), node.Name, k8stypes.StrategicMergePatchType, bytes, metav1.PatchOptions{}) | ||
if err != nil { | ||
klog.Errorf("patch pod %v failed, %v", node.Name, err) | ||
} | ||
return err | ||
} | ||
|
||
func extractGeoMetriyFromType(t string) ([]config.Geometry, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Typo in the function name 'extractGeoMetriyFromType'; consider renaming it to 'extractGeometryFromType' for clarity and consistency.
func extractGeoMetriyFromType(t string) ([]config.Geometry, error) { | |
func extractGeometryFromType(t string) ([]config.Geometry, error) { |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopted
@@ -131,6 +131,10 @@ curl {volcano device plugin pod ip}:9394/metrics | |||
``` | |||
 | |||
|
|||
### Customize | |||
|
|||
You can customize volcano-vgpu feature by modifying configMap **volcano-vgpu-device-config**, that configMap is automatically deployed when you setup volcano-vgpu-device-plugin. You can change **resourceCountName**, **resourceMemoryName**, **gpuMemoryFactor**, **MigTemplates**, and others related to volcano-vgpu by editing that configMap, for more information, refer to volcano-vgpu-device-plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer to volcano-vgpu-device-plugin
, maybe add a ref link here is better?
return nil, err | ||
} | ||
} | ||
data, ok := cm.Data["device-config.yaml"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract device-config.yaml
as a constant is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopted
Type: items[3], | ||
PodMap: make(map[string]*GPUUsage), | ||
Health: health, | ||
Mode: "hami-core", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also better to extract these modes to constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, updated
} | ||
|
||
func InitDevicesConfig(cmName string) { | ||
once.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to confirm that we only initialize once here? And if the user modifies cm and want to reload the config, they can only restart the scheduler, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's for current version, we will add a informer to watch this configMap in future version.
Please also clean and squash your commit |
} | ||
|
||
// NewClient connects to an API server | ||
func NewClient() (*kubernetes.Clientset, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omitted this comment? @archlitchi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i have already fixed
return &yamlData, nil | ||
} | ||
|
||
func InitDevicesConfig(cmName string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not initialize it in the devicplugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are initialized in device plugin as well, the config needs to be loaded both to vc-scheduler and volcano-vgpu-device-plugin
|
||
const DeviceConfigFileName = "device-config.yaml" | ||
|
||
type Config struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this config only used in vgpu, if so, maybe put it in vgpu pkg is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are designed to support multiple heterogeneous devices, not only vgpu
limitations under the License. | ||
*/ | ||
|
||
package config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exported fields in this file should add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopted
MigTemplate: []config.Geometry{}, | ||
MigUsage: config.MigInUse{}, | ||
} | ||
if len(items) > 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 is a magic number, we'd better explain why and add a sample data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 is the length of items before dynamic-mig feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a comment and give a sample data.
if err != nil { | ||
configs = &Config{ | ||
NvidiaConfig: NvidiaConfig{ | ||
ResourceCountName: "volcano.sh/vgpu-number", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are constant definition of vgpu-memory and vgpu-member, we'd better re-use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopted
return configs | ||
} | ||
|
||
func LoadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopted
6be54d3
to
61c02fa
Compare
Patch(context.Background(), node.Name, k8stypes.StrategicMergePatchType, bytes, metav1.PatchOptions{}) | ||
if err != nil { | ||
klog.Errorf("patch pod %v failed, %v", node.Name, err) | ||
} | ||
return err | ||
} | ||
|
||
func extractGeometryFromType(t string) ([]config.Geometry, error) { | ||
for _, val := range config.GetConfig().NvidiaConfig.MigGeometriesList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check config.GetConfig() not nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
4b08d28
to
86ea6e2
Compare
/approve |
return configs | ||
} | ||
|
||
func loadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get ns by downward API.
} | ||
|
||
func loadConfigFromCM(kubeClient kubernetes.Interface, cmName string) (*Config, error) { | ||
cm, err := kubeClient.CoreV1().ConfigMaps("kube-system").Get(context.Background(), cmName, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, why the device config cm is got from "kube-system" ns here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because this cm is installed by 'volcano-vgpu-device-plugin', which default ns is 'kube-system'
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Monokaix, william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: limengxuan <391013634@qq.com>
/lgtm |
/kind feature
/area scheduling
After last Friday's discussion, we decided to use sync.once to initialize device config from configmap once during initializing. This Feature is used combined with latest version of 'https://github.com/Project-HAMi/volcano-vgpu-device-plugin'
This PR exposes vGPU resourceName as #3926 does, Using device--configMap is a better than using scheduler-configmap, since device-config can be accessed by volcano-vgpu-device-plugin.
To be noticed: