[Feature] Load model weight in parallel

### Checklist

- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

We're using [a distributed file system](https://juicefs.com) to store LLM weights in a Kubernetes environment. As a typical design choice, the system is tuned for max parallelism, which behaves relatively poor with single-threaded, sequential reads. Through benchmarking, we found that model loading can be up to 5 times faster by using 8 threads, compared to the current performance of SGLang.

We hope there can be an option to enable parallelism while reading the model weights. It is not so useful for users who store their weights in a physical drive, but could be life-saving for users with distributed storage backend, including S3 (via S3FS).

### Related resources

vLLM uses Run:ai Model Streamer for streaming models concurrently to GPUs: https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html

Triton also supports loading models in parallel: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_management.html#concurrently-loading-models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Load model weight in parallel #4822

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Load model weight in parallel #4822

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions