fix: get gpu config from status if preferred nodes provided #1308

zhuangqh · 2025-07-24T04:12:33Z

Reason for Change:

Requirements

added unit tests and e2e tests (if applicable).

Issue Fixed:

related to #1297

Notes for Reviewers:

kaito-pr-agent · 2025-07-24T04:12:54Z

Title

Enhance GPU Config Retrieval Based on Preferred Nodes

Description

Refactored GPU configuration retrieval logic
Added getGPUConfig function to handle GPU config based on preferred nodes
Updated tests to cover new GPU config retrieval paths

Changes walkthrough 📝

Relevant files

Enhancement

preset-inferences.go `Refactor GPU config retrieval and usage` pkg/workspace/inference/preset-inferences.go Introduced `getGPUConfig` function to determine GPU configuration Modified `GeneratePresetInference` to use `getGPUConfig` Updated `GenerateInferencePodSpec` to use GPU config directly Adjusted `checkIfNVMeAvailable` to use GPU config pointer	+42/-24

Tests

preset-inferences_test.go `Add tests for GPU config retrieval` pkg/workspace/inference/preset-inferences_test.go Added `TestGetGPUConfig` to test GPU config retrieval logic Updated `TestGeneratePresetInference` and `TestGetDistributedInferenceProbe` to use `metav1.ObjectMeta`	+173/-4

Need help?
Type /help how to ... in the comments thread for any questions about PR-Agent usage.
Check out the documentation for more information.

kaito-pr-agent · 2025-07-24T04:13:11Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue The `getGPUConfig` function assumes that `gpuConfig` will always be non-nil when calling `gpuConfig`. This could lead to a panic if `gpuConfig` is nil. // 1. try to get GPU config from known sku if instanceType is set if len(ctx.Workspace.Resource.PreferredNodes) == 0 { gpuConfig, _ = utils.GetGPUConfigBySKU(ctx.Workspace.Resource.InstanceType) if gpuConfig != nil { return gpuConfig } } // 2. try to get GPU config from the node status gpuConfig, err = utils.TryGetGPUConfigFromNode(ctx.Ctx, ctx.KubeClient, ctx.Workspace.Status.WorkerNodes) if err == nil { return gpuConfig } // 3. if both above methods fail, use the default GPU count requirement from the model // FIXME: assume gpu nodes are provided here. cpu inference should not go through this path. defaultNumGPU := resource.MustParse(ctx.Model.GetInferenceParameters().GPUCountRequirement) skuNumGPUs := int(defaultNumGPU.Value()) return sku.GPUConfig{ GPUCount: skuNumGPUs, } Redundant Code* The `getGPUConfig` function includes a comment `FIXME: assume gpu nodes are provided here. cpu inference should not go through this path.` indicating potential redundant or incorrect logic for CPU inference. // FIXME: assume gpu nodes are provided here. cpu inference should not go through this path. defaultNumGPU := resource.MustParse(ctx.Model.GetInferenceParameters().GPUCountRequirement) skuNumGPUs := int(defaultNumGPU.Value()) return sku.GPUConfig{

kaito-pr-agent · 2025-07-24T04:14:05Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Handle SKU config error Handle the error returned by `utils.GetGPUConfigBySKU` to avoid ignoring potential issues. pkg/workspace/inference/preset-inferences.go [212-214] -gpuConfig, _ = utils.GetGPUConfigBySKU(ctx.Workspace.Resource.InstanceType) +gpuConfig, err = utils.GetGPUConfigBySKU(ctx.Workspace.Resource.InstanceType) +if err != nil { + // Log the error or handle it appropriately + return sku.GPUConfig{} +} if gpuConfig != nil { return *gpuConfig } Suggestion importance[1-10]: 8 __ Why: Handling the error from `utils.GetGPUConfigBySKU` prevents potential issues from being ignored, improving robustness.	Medium
	Handle GPU count parsing error Ensure `defaultNumGPU` parsing does not panic and handle potential errors gracefully. pkg/workspace/inference/preset-inferences.go [226-227] -defaultNumGPU := resource.MustParse(ctx.Model.GetInferenceParameters().GPUCountRequirement) +defaultNumGPU, err := resource.ParseQuantity(ctx.Model.GetInferenceParameters().GPUCountRequirement) +if err != nil { + // Log the error or handle it appropriately + return sku.GPUConfig{} +} skuNumGPUs := int(defaultNumGPU.Value()) return sku.GPUConfig{ GPUCount: skuNumGPUs, } Suggestion importance[1-10]: 8 __ Why: Ensuring `defaultNumGPU` parsing does not panic and handling potential errors gracefully improves the reliability of the code.	Medium
	Check GPU config nil Ensure `gpuConfig` is not nil before dereferencing it. pkg/workspace/inference/preset-inferences.go [220-221] gpuConfig, err = utils.TryGetGPUConfigFromNode(ctx.Ctx, ctx.KubeClient, ctx.Workspace.Status.WorkerNodes) -if err == nil { +if err == nil && gpuConfig != nil { return *gpuConfig } Suggestion importance[1-10]: 8 __ Why: Checking `gpuConfig` for nil before dereferencing it prevents potential nil pointer dereference errors, enhancing code safety.	Medium

Signed-off-by: zhuangqh <zhuangqhc@gmail.com>

zhuangqh requested review from Fei-Guo and chewong as code owners July 24, 2025 04:12

github-project-automation bot added this to KAITO Roadmap Jul 24, 2025

zhuangqh temporarily deployed to unit-tests July 24, 2025 04:12 — with GitHub Actions Inactive

zhuangqh had a problem deploying to e2e-test July 24, 2025 04:12 — with GitHub Actions Error

kaito-pr-agent bot added the Review effort 4/5 label Jul 24, 2025

zhuangqh added 2 commits July 24, 2025 14:36

fix: get gpu config from status if preferred nodes provided

975b249

Signed-off-by: zhuangqh <zhuangqhc@gmail.com>

fix

a875da5

Signed-off-by: zhuangqh <zhuangqhc@gmail.com>

zhuangqh force-pushed the fix-preferred-nodes branch from c086598 to a875da5 Compare July 24, 2025 04:36

zhuangqh temporarily deployed to unit-tests July 24, 2025 04:36 — with GitHub Actions Inactive

zhuangqh had a problem deploying to e2e-test July 24, 2025 04:36 — with GitHub Actions Failure

Fei-Guo approved these changes Jul 24, 2025

View reviewed changes

bangqipropel approved these changes Jul 24, 2025

View reviewed changes

Merge branch 'main' into fix-preferred-nodes

cd3bc48

zhuangqh temporarily deployed to e2e-test July 25, 2025 05:51 — with GitHub Actions Inactive

zhuangqh temporarily deployed to unit-tests July 25, 2025 05:51 — with GitHub Actions Inactive

zhuangqh merged commit 2163923 into kaito-project:main Jul 25, 2025
12 checks passed

github-project-automation bot moved this to Done in KAITO Roadmap Jul 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: get gpu config from status if preferred nodes provided #1308

fix: get gpu config from status if preferred nodes provided #1308

Uh oh!

zhuangqh commented Jul 24, 2025

Uh oh!

kaito-pr-agent bot commented Jul 24, 2025

Uh oh!

kaito-pr-agent bot commented Jul 24, 2025

Uh oh!

kaito-pr-agent bot commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

fix: get gpu config from status if preferred nodes provided #1308

fix: get gpu config from status if preferred nodes provided #1308

Uh oh!

Conversation

zhuangqh commented Jul 24, 2025

Uh oh!

kaito-pr-agent bot commented Jul 24, 2025

Title

Description

Changes walkthrough 📝

Uh oh!

kaito-pr-agent bot commented Jul 24, 2025

PR Reviewer Guide 🔍

Uh oh!

kaito-pr-agent bot commented Jul 24, 2025

PR Code Suggestions ✨

Uh oh!

Uh oh!

Uh oh!