Skip to content

[Feature] Load model weight in parallel #4822

@stevapple

Description

@stevapple

Checklist

Motivation

We're using a distributed file system to store LLM weights in a Kubernetes environment. As a typical design choice, the system is tuned for max parallelism, which behaves relatively poor with single-threaded, sequential reads. Through benchmarking, we found that model loading can be up to 5 times faster by using 8 threads, compared to the current performance of SGLang.

We hope there can be an option to enable parallelism while reading the model weights. It is not so useful for users who store their weights in a physical drive, but could be life-saving for users with distributed storage backend, including S3 (via S3FS).

Related resources

vLLM uses Run:ai Model Streamer for streaming models concurrently to GPUs: https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html

Triton also supports loading models in parallel: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_management.html#concurrently-loading-models

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions