Skip to content

Optimize vllm num_generations #2855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 18, 2025
Merged

Optimize vllm num_generations #2855

merged 4 commits into from
Feb 18, 2025

Conversation

edbeeching
Copy link
Collaborator

What does this PR do?

Adds an optimization of vllm batching by using n=num_generations in the vllm SamplingParameters.

@qgallouedec I looking at some examples in the debugger and the ordering of the gathered prompts always seems to be always [num_generations] * [num_prompts] but I am concerned there could be edge cases where this is not the case, what do you think?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

It should always be the case indeed.
The built-in set isn't ordered right?
Is vllm faster when you pass 1 prompt with n=N outputs, than with N times the same prompt for n=1?

@edbeeching
Copy link
Collaborator Author

It should always be the case indeed. The built-in set isn't ordered right? Is vllm faster when you pass 1 prompt with n=N outputs, than with N times the same prompt for n=1?

Unfortunately, set is not ordered. Yes vllm can share the prefill for the n generations so it is faster, I profiled around 1.5x faster with the changes in this PR at 2k max_completion_length.

@qgallouedec
Copy link
Member

Nice!! I am surprised, I expected a smaller speedup given that the prefix should already be reused since #2757.

We should probably do the same with tranformers generation in a future PR, if it makes sense.

Anyway, can you just add comment somewhere to explain why we do this?

@winglian
Copy link
Contributor

My guess is it's an easier optimization for vllm to understand that single prompt has multiple generations than sending the same prompt multiple times from the #2776 refactor.

@edbeeching
Copy link
Collaborator Author

@qgallouedec, without diving into the codebase of vllm, I would assume that the prefix cache is only used to compare a new batch of prompts with previously processed prompts. The system prompt is shared across all prompts, so this is cached and reused for all batches, whereas a new batch of prompts would first all need to have their prefill calculated and entered into the cache before vllm could identify that there are num_generations of the prompts are exactly the same. Hence you get some improvement when you move from prompt*num_generations to n generations for each prompt.

Let me know if you would like me to clarify.

@qgallouedec
Copy link
Member

qgallouedec commented Feb 14, 2025

Thanks Ed! Actually I meant adding a comment in the code to concisely explain why we merge the prompts. Something like Since 'prompts' contains 'num_generations' duplicates, we first take unique prompts, and generate num_generations outputs for each one. This is faster than generating outputs for each duplicate prompt individually..

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Feel free to merge

@edbeeching edbeeching merged commit 963243a into main Feb 18, 2025
14 checks passed
@edbeeching edbeeching deleted the vllm-batching branch February 18, 2025 10:44
yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025
* small optimization of vllm batching

* style

* adds comment

* style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants