Skip to content

Basic Tutorial: Adding a New LLM Inference/Serving Backend #21

@PeterSH6

Description

@PeterSH6
  1. Prerequisite: Make sure the LLM Inference framework can be launched following the SPMD style. For example, the LLM inference script can be launched by torchrun --standalone --nproc=8 offline_inference.py
  2. A Rollout class: Build a xxx_rollout.py script similar to vllm_rollout.py. In this file, define a xxxRollout class that inherits from BaseRollout.
    1. This class should have a generate_sequence API that accepts a batch of input_ids, response_masks, and position_ids from the DataProto as input. The self.inference_engine (e.g., LLMEngine in vLLM) is responsible for performing auto-regressive generation and outputting a batch of responses. These responses should then be concatenated with input_ids, and the response_masks and position_ids should be reconstructed accordingly.
  3. ShardingManager Classes for Weight Synchronization with Training Frameworks: Create files named fsdp_xxx.py and megatron_xxx.py, similar to fsdp_vllm.py and megatron_vllm.py. These files should define XXXShardingManager classes (i.e., HybridEngine) that handle weight sharding between the training and inference frameworks.
    1. In megatron_vllm.py, we define an AllGatherPPModel class to collect weights across the pipeline parallelism dimension. The parameters stored in the memory_buffers of AllGatherPPModel will be used to synchronize the weights with the models in the vLLM rollout.
  4. Weight loading issues: It may be necessary to provide specific weight loaders for transferring weights between different LLM Inference and Training backends for each model. This is similar to the dtensor_weight_loader.py and megatron_weight_loader.py files in vLLM.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions