Skip to content

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

@Jeffwan

Description

@Jeffwan

🚀 The feature, motivation and pitch

Currently, distributed inference (TP) in vLLM relies on ray to orchestrate the gpu workers. I briefly check the code and seems the core distributed communication is provided by torch.distributed with nccl backend, actor's communication is not done in Ray's own protocol. In this case, Ray just plays the role of orchestration and resource reservation (placement group). Please correct me if I am wrong.

We do use Ray and KubeRay on Kubernetes and I've successfully tested vLLM distributed inference on this setup, confirming its functional operation. However, we have many users/platforms, we do not want to lock on Ray since some teams may not have enough Ray knowledge to cover the operation. My proposal is to provide a simple orchestration on top of GPUExecutor for those users who are familiar with cloud native techs. They would like to use Kubernetes's capability for orchestration (ray actors) and scheduling (placement group).

Ideally, we would have both Ray and Kubernetes as orchestrators for vLLMs, providing our platform users with alternative options for their needs.

Please help check whether this proposal makes sense. I can contribute to this feature.

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requeststaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions