-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Optimize vllm num_generations #2855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
It should always be the case indeed. |
Unfortunately, |
Nice!! I am surprised, I expected a smaller speedup given that the prefix should already be reused since #2757. We should probably do the same with tranformers generation in a future PR, if it makes sense. Anyway, can you just add comment somewhere to explain why we do this? |
My guess is it's an easier optimization for vllm to understand that single prompt has multiple generations than sending the same prompt multiple times from the #2776 refactor. |
@qgallouedec, without diving into the codebase of vllm, I would assume that the prefix cache is only used to compare a new batch of prompts with previously processed prompts. The system prompt is shared across all prompts, so this is cached and reused for all batches, whereas a new batch of prompts would first all need to have their prefill calculated and entered into the cache before vllm could identify that there are Let me know if you would like me to clarify. |
Thanks Ed! Actually I meant adding a comment in the code to concisely explain why we merge the prompts. Something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Feel free to merge
* small optimization of vllm batching * style * adds comment * style
What does this PR do?
Adds an optimization of vllm batching by using
n=num_generations
in the vllmSamplingParameters
.@qgallouedec I looking at some examples in the debugger and the ordering of the gathered prompts always seems to be always
[num_generations] * [num_prompts]
but I am concerned there could be edge cases where this is not the case, what do you think?