Skip to content

fix: fix bug where fetch GPU count was failing and defaulting #1338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 5, 2025

Conversation

Jont828
Copy link
Contributor

@Jont828 Jont828 commented Jul 30, 2025

Reason for Change:

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Fixes #1335
Notes for Reviewers:

Copy link

Title

Fix GPU count fetching and refactor node retrieval


Description

  • Fixed bug in fetching GPU count from nodes

  • Updated node retrieval method to use Get instead of List

  • Corrected error handling and return statements

  • Refactored variable names and imports for consistency


Changes walkthrough 📝

Relevant files
Bug fix
common.go
Refactor node retrieval and fix GPU count fetching             

pkg/utils/common.go

  • Changed v1.NodeList to corev1.NodeList
  • Replaced List with Get for node retrieval
  • Updated error handling and return statements
  • Refactored variable names and imports
  • +12/-17 
    Refactoring
    common_test.go
    Simplify test setup                                                                           

    pkg/utils/common_test.go

  • Removed unused import for client
  • Simplified fake client setup by removing indexer
  • +0/-7     

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Incorrect GPU Count Aggregation

    The function GetPerNodeGPUCountFromNodes currently returns the GPU count of the first node that has GPUs. It should aggregate the GPU counts from all nodes.

    func GetPerNodeGPUCountFromNodes(nodeList *corev1.NodeList) int {
    	for _, node := range nodeList.Items {
    		gpuCount, exists := node.Status.Capacity[consts.NvidiaGPU]
    		if exists && gpuCount.String() != "" {
    			return int(gpuCount.Value())
    Error Handling

    The function FetchGPUCountFromNodes returns an error immediately when it fails to get a node. It should continue processing other nodes and return the count of successfully fetched nodes.

    if err := kubeClient.Get(ctx, types.NamespacedName{Name: nodeName}, node); err != nil { // Note: nodes don't have a namespace here.
    	return 0, fmt.Errorf("failed to get node %s: %w", nodeName, err)
    }
    allNodes.Items = append(allNodes.Items, *node)

    Copy link
    Collaborator

    @Fei-Guo Fei-Guo left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Nice catch. Thanks for the fix.

    @chewong chewong changed the title bug: fix bug where fetch GPU count was failing and defaulting fix: fix bug where fetch GPU count was failing and defaulting Jul 31, 2025
    @chewong
    Copy link
    Collaborator

    chewong commented Jul 31, 2025

    @Jont828 could you resolve the unit test failure?

    @Fei-Guo Fei-Guo merged commit 8f4fa75 into kaito-project:main Aug 5, 2025
    14 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    Status: Done
    Development

    Successfully merging this pull request may close these issues.

    Incorrect GPU resource request when preferred nodes are used
    4 participants