Skip to content

Conversation

zhuangqh
Copy link
Collaborator

@zhuangqh zhuangqh commented Aug 6, 2025

Reason for Change:

  • bump base image to 0.0.5
    • update basic py lib
    • add chat template for deepseek series model
  • migrate preset test to A10

- migrate preset test to A10

Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
Copy link

kaito-pr-agent bot commented Aug 6, 2025

Title

Update base image and VM sizes for A10 series


Description

  • Updated base image to version 0.0.5

  • Changed node VM size to A10 series across multiple configurations

  • Reduced max-parallel jobs in e2e-preset-test workflow

  • Added max-model-len parameter in inference-tmpl manifest


Changes walkthrough 📝

Relevant files
Enhancement
e2e-preset-configs.json
Update VM sizes to A10 series                                                       

.github/e2e-preset-configs.json

  • Updated node-vm-size to Standard_NV36ads_A10_v5 for various models
  • Updated node-vm-size to Standard_NV72ads_A10_v5 for
    qwen2.5-coder-7b-instruct
  • +14/-14 
    manifest.yaml
    Add max-model-len parameter                                                           

    presets/workspace/test/manifests/inference-tmpl/manifest.yaml

    • Added max-model-len parameter with value 2048
    +1/-0     
    Configuration changes
    e2e-preset-test.yaml
    Reduce parallel job limit                                                               

    .github/workflows/e2e-preset-test.yaml

    • Reduced max-parallel from 10 to 5
    +1/-1     
    supported_models.yaml
    Update base image tag                                                                       

    presets/workspace/models/supported_models.yaml

    • Updated tag from 0.0.4 to 0.0.5
    +2/-1     

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • Copy link

    kaito-pr-agent bot commented Aug 6, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Configuration Update

    Ensure that the new VM size Standard_NV36ads_A10_v5 is compatible with the current infrastructure and that it provides the necessary resources for the workloads.

      "node-vm-size": "Standard_NV36ads_A10_v5",
      "node-osdisk-size": 100,
      "OSS": true,
      "loads_adapter": false,
      "node_pool": "falcon7b",
      "runtimes": {
        "hf": {
          "command": "accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all /workspace/tfs/inference_api.py --pipeline text-generation --torch_dtype bfloat16",
          "gpu_count": 1
        },
        "vllm": {
          "command": "python3 /workspace/vllm/inference_api.py --kaito-max-probe-steps 6 --dtype float16 --chat-template /workspace/chat_templates/falcon-instruct.jinja",
          "gpu_count": 1
        }
      }
    },
    {
      "name": "falcon-7b-instruct",
      "node-count": 1,
      "node-vm-size": "Standard_NV36ads_A10_v5",
      "node-osdisk-size": 100,
      "OSS": true,
      "loads_adapter": false,
      "node_pool": "falcon7binst",
      "runtimes": {
        "hf": {
          "command": "accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all /workspace/tfs/inference_api.py --pipeline text-generation --torch_dtype bfloat16",
          "gpu_count": 1
        },
        "vllm": {
          "command": "python3 /workspace/vllm/inference_api.py --kaito-max-probe-steps 6 --dtype float16 --chat-template /workspace/chat_templates/falcon-instruct.jinja",
          "gpu_count": 1
        }
      }
    },
    {
      "name": "falcon-40b",
      "node-count": 1,
      "node-vm-size": "Standard_NC48ads_A100_v4",
      "node-osdisk-size": 400,
      "OSS": true,
      "loads_adapter": false,
      "node_pool": "falcon40b",
      "runtimes": {
        "hf": {
          "command": "accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all /workspace/tfs/inference_api.py --pipeline text-generation --torch_dtype bfloat16",
          "gpu_count": 2
        },
        "vllm": {
          "command": "python3 /workspace/vllm/inference_api.py --dtype bfloat16 --chat-template /workspace/chat_templates/falcon-instruct.jinja --tensor-parallel-size 2",
          "gpu_count": 2
        }
      }
    },
    {
      "name": "falcon-40b-instruct",
      "node-count": 1,
      "node-vm-size": "Standard_NC48ads_A100_v4",
      "node-osdisk-size": 400,
      "OSS": true,
      "loads_adapter": false,
      "node_pool": "falcon40bins",
      "runtimes": {
        "hf": {
          "command": "accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all /workspace/tfs/inference_api.py --pipeline text-generation --torch_dtype bfloat16",
          "gpu_count": 2
        },
        "vllm": {
          "command": "python3 /workspace/vllm/inference_api.py --dtype bfloat16 --chat-template /workspace/chat_templates/falcon-instruct.jinja --tensor-parallel-size 2",
          "gpu_count": 2
        }
      }
    },
    {
      "name": "mistral-7b",
      "node-count": 1,
      "node-vm-size": "Standard_NV36ads_A10_v5",
      "node-osdisk-size": 100,
      "OSS": true,
      "loads_adapter": false,
      "node_pool": "mistral7b",
      "runtimes": {
        "hf": {
          "command": "accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all /workspace/tfs/inference_api.py --pipeline text-generation --torch_dtype bfloat16",
          "gpu_count": 1
        },
        "vllm": {
          "command": "python3 /workspace/vllm/inference_api.py --kaito-max-probe-steps 6 --dtype float16 --chat-template /workspace/chat_templates/mistral-instruct.jinja",
          "gpu_count": 1
        }
      }
    },
    {
      "name": "mistral-7b-instruct",
      "node-count": 1,
      "node-vm-size": "Standard_NV36ads_A10_v5",
      "node-osdisk-size": 100,
      "OSS": true,
      "loads_adapter": false,
      "node_pool": "mistral7bins",
      "runtimes": {
        "hf": {
          "command": "accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all /workspace/tfs/inference_api.py --pipeline text-generation --torch_dtype bfloat16",
          "gpu_count": 1
        },
        "vllm": {
          "command": "python3 /workspace/vllm/inference_api.py --kaito-max-probe-steps 6 --dtype float16",
          "gpu_count": 1
        }
      }
    },
    {
      "name": "phi-2",
      "node-count": 1,
      "node-vm-size": "Standard_NV36ads_A10_v5",
      "node-osdisk-size": 50,
      "OSS": true,
      "loads_adapter": false,
      "node_pool": "phi2",
      "runtimes": {
        "hf": {
          "command": "accelerate launch --num_processes 1 --num_machines 1 --machine_rank 0 --gpu_ids all /workspace/tfs/inference_api.py --pipeline text-generation --torch_dtype bfloat16",
          "gpu_count": 1
        },
        "vllm": {
          "command": "python3 /workspace/vllm/inference_api.py --kaito-max-probe-steps 6 --dtype float16",
          "gpu_count": 1
        }
      }
    },
    {
      "name": "phi-3-mini-4k-instruct",
      "node-count": 1,
      "node-vm-size": "Standard_NV36ads_A10_v5",
      "node-osdisk-size": 50,
      "OSS": true,
      "loads_adapter": false,
      "node_pool": "phi3mini4kin",
      "runtimes": {
        "hf": {
    Parallel Execution

    Verify that reducing max-parallel from 10 to 5 does not negatively impact the test execution time or resource utilization.

    max-parallel: 5
    Version Update

    Confirm that the version bump to 0.0.5 includes all necessary changes and that the chat-template updates do not break existing functionalities.

    tag: 0.0.5
    # Tag history:
    # 0.0.5 - update chat-template for new models
    # 0.0.4 - bump vLLM to 0.9.0.
    # 0.0.3 - add depdenencies to support vLLM multi-node distributed inference
    # 0.0.2 - bump vLLM to 0.8.5. Tool chat template added.
    # 0.0.1 - Initial Release

    @zhuangqh
    Copy link
    Collaborator Author

    zhuangqh commented Aug 6, 2025

    This pr need to read secret from repo to push image. Thus, I need to push it to the main repo instead of forking. @chewong

    Copy link

    kaito-pr-agent bot commented Aug 6, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Validate VM size compatibility

    Verify that Standard_NV72ads_A10_v5 is compatible with the workload requirements and
    cost considerations.

    .github/e2e-preset-configs.json [159]

    +"node-vm-size": "Standard_NV72ads_A10_v5",
     
    -
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion asks to verify the compatibility of the VM size with workload requirements and cost considerations, which is important but not critical. The existing code and improved code are identical, so the score is capped at 6.

    Low
    Assess parallel job impact

    Ensure that reducing max-parallel to 5 does not impact the testing throughput and
    resource utilization.

    .github/e2e-preset-configs.json [131]

    +"max-parallel": 5,
     
    -
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion asks to ensure that reducing max-parallel to 5 does not impact testing throughput and resource utilization, which is important but not critical. The existing code and improved code are identical, so the score is capped at 6.

    Low
    Verify tag updates

    Confirm that the new tag 0.0.5 includes all necessary updates and is backward
    compatible.

    presets/workspace/models/supported_models.yaml [6]

    +tag: 0.0.5
     
    -
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion asks to confirm that the new tag 0.0.5 includes all necessary updates and is backward compatible, which is important but not critical. The existing code and improved code are identical, so the score is capped at 6.

    Low
    Check model length setting

    Ensure that setting max-model-len to 2048 meets the model's requirements and does
    not cause truncation issues.

    presets/workspace/test/manifests/inference-tmpl/manifest.yaml [97]

    +max-model-len: 2048
     
    -
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion asks to ensure that setting max-model-len to 2048 meets the model's requirements and does not cause truncation issues, which is important but not critical. The existing code and improved code are identical, so the score is capped at 6.

    Low

    Copy link

    codecov bot commented Aug 6, 2025

    Codecov Report

    ✅ All modified and coverable lines are covered by tests.

    @@           Coverage Diff           @@
    ##             main    #1364   +/-   ##
    =======================================
      Coverage   58.39%   58.39%           
    =======================================
      Files          76       76           
      Lines        8720     8720           
    =======================================
      Hits         5092     5092           
      Misses       3406     3406           
      Partials      222      222           
    Components Coverage Δ
    workspace 46.85% <ø> (ø)
    presets 87.84% <ø> (ø)
    main ∅ <ø> (∅)
    🚀 New features to boost your workflow:
    • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
    • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    @zhuangqh zhuangqh merged commit 4ad29f8 into main Aug 7, 2025
    18 of 23 checks passed
    @zhuangqh zhuangqh deleted the bump-img-005 branch August 7, 2025 05:51
    farooqameen pushed a commit to farooqameen/kaito that referenced this pull request Aug 7, 2025
    **Reason for Change**:
    
    - bump base image to 0.0.5
      - update basic py lib
      - add chat template for deepseek series model
    - migrate preset test to A10
    
    ---------
    
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    Co-authored-by: Fei Guo <vrgf2003@gmail.com>
    farooqameen pushed a commit to farooqameen/kaito that referenced this pull request Aug 7, 2025
    **Reason for Change**:
    
    - bump base image to 0.0.5
      - update basic py lib
      - add chat template for deepseek series model
    - migrate preset test to A10
    
    ---------
    
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    Co-authored-by: Fei Guo <vrgf2003@gmail.com>
    farooqameen pushed a commit to farooqameen/kaito that referenced this pull request Aug 7, 2025
    **Reason for Change**:
    
    - bump base image to 0.0.5
      - update basic py lib
      - add chat template for deepseek series model
    - migrate preset test to A10
    
    ---------
    
    Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
    Co-authored-by: Fei Guo <vrgf2003@gmail.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    Status: Done
    Development

    Successfully merging this pull request may close these issues.

    2 participants