Add single lora adapter support for vLLM inference. #1679
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
When evaluating SFT/DPO trained models, vLLM accelerated inference with a single LoRA adapter is often used, but this feature isn’t supported in the source code. Therefore, I added a few lines of code to enable this simple functionality.
Modification
Added a lora_path in opencompass/models/vllm.py and utilized it during generate.
Use cases (Optional)
Now we can use LoRA vLLM inference as shown in the code below.
models = [
dict(
type=VLLM,
abbr='Llama3_8B_LoRA_SFT',
path='llama-3-8b-instruct',
model_kwargs=dict(tensor_parallel_size=2, dtype='bfloat16', seed=0, max_model_len=4096, enable_lora=True,),
max_out_len=100,
max_seq_len=4096,
batch_size=32,
lora_path="Llama3_8B_LoRA_checkpoints/checkpoint-1250/",
generation_kwargs=dict(temperature=0.0, top_p=0.8, max_tokens=1024,),
stop_words=['<|end_of_text|>', '<|eot_id|>'],
run_cfg=dict(num_gpus=2),
)
]