Skip to content

Conversation

bangqipropel
Copy link
Collaborator

Reason for Change:

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

Copy link

Title

Support Preferred Nodes in RAG and Enhance GPU Configuration


Description

  • Added support for Preferred Nodes in RAG configuration

  • Enhanced GPU configuration logic based on Preferred Nodes

  • Updated tests to include scenarios with Preferred Nodes

  • Modified e2e workflow to create a node pool for RAG workloads


Changes walkthrough 📝

Relevant files
Enhancement
preset-rag.go
Enhanced GPU configuration logic                                                 

pkg/ragengine/controllers/preset-rag.go

  • Imported sku package
  • Modified GPU configuration logic to consider Preferred Nodes
  • +12/-6   
    Tests
    preset-rag_test.go
    Added GPU config tests                                                                     

    pkg/ragengine/controllers/preset-rag_test.go

  • Added new test cases for GPU configuration logic
  • Imported additional packages
  • +67/-0   
    testUtils.go
    Added mock for preferred nodes                                                     

    pkg/utils/test/testUtils.go

    • Added MockRAGEngineWithPresetPreferredCPUNodes for testing
    +27/-0   
    rag_test.go
    Updated and added tests for preferred nodes                           

    test/rage2e/rag_test.go

  • Updated pod creation functions to accept pod names
  • Added new test case for RAG with preferred nodes
  • Modified existing test cases to pass pod names
  • +144/-28
    Configuration changes
    ragengine-e2e-workflow.yml
    Added node pool creation in e2e workflow                                 

    .github/workflows/ragengine-e2e-workflow.yml

    • Added step to create node pool for RAG workloads
    +15/-0   

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Error Handling

    The error returned by utils.GetGPUConfigBySKU is ignored. This could lead to unexpected behavior if GetGPUConfigBySKU fails.

    if len(ragEngineObj.Spec.Compute.PreferredNodes) == 0 {
    Sleep Call

    A time.Sleep(30 * time.Second) call is added, which can slow down the tests and make them less reliable. Consider using a more robust method to wait for resources to be ready.

    time.Sleep(30 * time.Second)

    Signed-off-by: Bangqi Zhu <bangqizhu@microsoft.com>
    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Handle error from GPU config retrieval

    Handle the error returned by utils.GetGPUConfigBySKU to ensure proper error handling
    and logging.

    pkg/ragengine/controllers/preset-rag.go [122]

    -gpuConfig, _ = utils.GetGPUConfigBySKU(ragEngineObj.Spec.Compute.InstanceType)
    +gpuConfig, err = utils.GetGPUConfigBySKU(ragEngineObj.Spec.Compute.InstanceType)
    +if err != nil {
    +    // Log the error or handle it appropriately
    +    skuNumGPUs = 0
    +}
    Suggestion importance[1-10]: 8

    __

    Why: Proper error handling is crucial for robustness. Ignoring the error can lead to incorrect behavior if GPU configuration retrieval fails.

    Medium
    General
    Replace sleep with polling

    Replace time.Sleep with a more reliable mechanism for waiting, such as polling with
    a timeout.

    test/rage2e/rag_test.go [161]

    -time.Sleep(30 * time.Second)
    +// Example of polling with a timeout
    +ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    +defer cancel()
     
    +for ctx.Err() == nil {
    +    // Check condition
    +    if conditionMet {
    +        break
    +    }
    +    time.Sleep(1 * time.Second) // Adjust sleep duration as needed
    +}
    +
    +if ctx.Err() != nil {
    +    t.Errorf("Timed out waiting for condition")
    +}
    +
    Suggestion importance[1-10]: 7

    __

    Why: Polling with a timeout is more reliable than a fixed sleep, as it allows the test to proceed as soon as the condition is met, reducing unnecessary wait times.

    Medium
    Implement node lookup retry

    Consider retrying the node lookup or handling the case where no suitable node is
    found more gracefully.

    test/rage2e/rag_test.go [1016]

    -return "", fmt.Errorf("no node containing 'ragpool' found")
    +// Example of retrying the node lookup
    +attempts := 5
    +for i := 0; i < attempts; i++ {
    +    nodeList := &v1.NodeList{}
    +    err := utils.TestingCluster.KubeClient.List(ctx, nodeList)
    +    if err != nil {
    +        return "", fmt.Errorf("failed to list nodes: %v", err)
    +    }
     
    +    for _, node := range nodeList.Items {
    +        if strings.Contains(node.Name, "ragpool") {
    +            return node.Name, nil
    +        }
    +    }
    +    time.Sleep(5 * time.Second) // Adjust sleep duration as needed
    +}
    +return "", fmt.Errorf("no node containing 'ragpool' found after %d attempts", attempts)
    +
    Suggestion importance[1-10]: 7

    __

    Why: Retrying the node lookup improves reliability, especially in environments where node availability might be transient. This approach provides a better chance of finding the node within a reasonable timeframe.

    Medium
    Set default GPU count

    Ensure that skuNumGPUs is set to a default value that makes sense for the
    application logic.

    pkg/ragengine/controllers/preset-rag.go [130]

    -skuNumGPUs = 0
    +skuNumGPUs = 1 // or another appropriate default value
    Suggestion importance[1-10]: 5

    __

    Why: Setting a default value ensures that the application has a sensible fallback, but the specific value should be determined based on the application's requirements.

    Low

    @zhuangqh zhuangqh changed the title chore: Delete sleep chore: remove unnecessary sleep in test Jul 30, 2025
    @zhuangqh zhuangqh merged commit af43eb5 into kaito-project:main Jul 30, 2025
    13 of 14 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    Status: Done
    Development

    Successfully merging this pull request may close these issues.

    2 participants