Skip to content

Conversation

zhuangqh
Copy link
Collaborator

Reason for Change:

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

related to #1297

Notes for Reviewers:

Copy link

Title

Enhance GPU Config Retrieval Based on Preferred Nodes


Description

  • Refactored GPU configuration retrieval logic

  • Added getGPUConfig function to handle GPU config based on preferred nodes

  • Updated tests to cover new GPU config retrieval paths


Changes walkthrough 📝

Relevant files
Enhancement
preset-inferences.go
Refactor GPU config retrieval and usage                                   

pkg/workspace/inference/preset-inferences.go

  • Introduced getGPUConfig function to determine GPU configuration
  • Modified GeneratePresetInference to use getGPUConfig
  • Updated GenerateInferencePodSpec to use GPU config directly
  • Adjusted checkIfNVMeAvailable to use GPU config pointer
  • +42/-24 
    Tests
    preset-inferences_test.go
    Add tests for GPU config retrieval                                             

    pkg/workspace/inference/preset-inferences_test.go

  • Added TestGetGPUConfig to test GPU config retrieval logic
  • Updated TestGeneratePresetInference and
    TestGetDistributedInferenceProbe to use metav1.ObjectMeta
  • +173/-4 

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Possible Issue

    The getGPUConfig function assumes that gpuConfig will always be non-nil when calling *gpuConfig. This could lead to a panic if gpuConfig is nil.

    // 1. try to get GPU config from known sku if instanceType is set
    if len(ctx.Workspace.Resource.PreferredNodes) == 0 {
    	gpuConfig, _ = utils.GetGPUConfigBySKU(ctx.Workspace.Resource.InstanceType)
    	if gpuConfig != nil {
    		return *gpuConfig
    	}
    }
    
    // 2. try to get GPU config from the node status
    gpuConfig, err = utils.TryGetGPUConfigFromNode(ctx.Ctx, ctx.KubeClient, ctx.Workspace.Status.WorkerNodes)
    if err == nil {
    	return *gpuConfig
    }
    
    // 3. if both above methods fail, use the default GPU count requirement from the model
    //    FIXME: assume gpu nodes are provided here. cpu inference should not go through this path.
    defaultNumGPU := resource.MustParse(ctx.Model.GetInferenceParameters().GPUCountRequirement)
    skuNumGPUs := int(defaultNumGPU.Value())
    return sku.GPUConfig{
    	GPUCount: skuNumGPUs,
    }
    Redundant Code

    The getGPUConfig function includes a comment FIXME: assume gpu nodes are provided here. cpu inference should not go through this path. indicating potential redundant or incorrect logic for CPU inference.

    //    FIXME: assume gpu nodes are provided here. cpu inference should not go through this path.
    defaultNumGPU := resource.MustParse(ctx.Model.GetInferenceParameters().GPUCountRequirement)
    skuNumGPUs := int(defaultNumGPU.Value())
    return sku.GPUConfig{

    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Handle SKU config error

    Handle the error returned by utils.GetGPUConfigBySKU to avoid ignoring potential
    issues.

    pkg/workspace/inference/preset-inferences.go [212-214]

    -gpuConfig, _ = utils.GetGPUConfigBySKU(ctx.Workspace.Resource.InstanceType)
    +gpuConfig, err = utils.GetGPUConfigBySKU(ctx.Workspace.Resource.InstanceType)
    +if err != nil {
    +    // Log the error or handle it appropriately
    +    return sku.GPUConfig{}
    +}
     if gpuConfig != nil {
         return *gpuConfig
     }
    Suggestion importance[1-10]: 8

    __

    Why: Handling the error from utils.GetGPUConfigBySKU prevents potential issues from being ignored, improving robustness.

    Medium
    Handle GPU count parsing error

    Ensure defaultNumGPU parsing does not panic and handle potential errors gracefully.

    pkg/workspace/inference/preset-inferences.go [226-227]

    -defaultNumGPU := resource.MustParse(ctx.Model.GetInferenceParameters().GPUCountRequirement)
    +defaultNumGPU, err := resource.ParseQuantity(ctx.Model.GetInferenceParameters().GPUCountRequirement)
    +if err != nil {
    +    // Log the error or handle it appropriately
    +    return sku.GPUConfig{}
    +}
     skuNumGPUs := int(defaultNumGPU.Value())
     return sku.GPUConfig{
         GPUCount: skuNumGPUs,
     }
    Suggestion importance[1-10]: 8

    __

    Why: Ensuring defaultNumGPU parsing does not panic and handling potential errors gracefully improves the reliability of the code.

    Medium
    Check GPU config nil

    Ensure gpuConfig is not nil before dereferencing it.

    pkg/workspace/inference/preset-inferences.go [220-221]

     gpuConfig, err = utils.TryGetGPUConfigFromNode(ctx.Ctx, ctx.KubeClient, ctx.Workspace.Status.WorkerNodes)
    -if err == nil {
    +if err == nil && gpuConfig != nil {
         return *gpuConfig
     }
    Suggestion importance[1-10]: 8

    __

    Why: Checking gpuConfig for nil before dereferencing it prevents potential nil pointer dereference errors, enhancing code safety.

    Medium

    zhuangqh added 2 commits July 24, 2025 14:36
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    @zhuangqh zhuangqh merged commit 2163923 into kaito-project:main Jul 25, 2025
    12 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    Status: Done
    Development

    Successfully merging this pull request may close these issues.

    3 participants