-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
We're using a distributed file system to store LLM weights in a Kubernetes environment. As a typical design choice, the system is tuned for max parallelism, which behaves relatively poor with single-threaded, sequential reads. Through benchmarking, we found that model loading can be up to 5 times faster by using 8 threads, compared to the current performance of SGLang.
We hope there can be an option to enable parallelism while reading the model weights. It is not so useful for users who store their weights in a physical drive, but could be life-saving for users with distributed storage backend, including S3 (via S3FS).
Related resources
vLLM uses Run:ai Model Streamer for streaming models concurrently to GPUs: https://docs.vllm.ai/en/stable/models/extensions/runai_model_streamer.html
Triton also supports loading models in parallel: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_management.html#concurrently-loading-models