Optimize vllm num_generations #2855

edbeeching · 2025-02-13T17:21:46Z

What does this PR do?

Adds an optimization of vllm batching by using n=num_generations in the vllm SamplingParameters.

@qgallouedec I looking at some examples in the debugger and the ordering of the gathered prompts always seems to be always [num_generations] * [num_prompts] but I am concerned there could be edge cases where this is not the case, what do you think?

HuggingFaceDocBuilderDev · 2025-02-13T17:26:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-02-13T17:28:23Z

It should always be the case indeed.
The built-in set isn't ordered right?
Is vllm faster when you pass 1 prompt with n=N outputs, than with N times the same prompt for n=1?

edbeeching · 2025-02-13T17:58:23Z

It should always be the case indeed. The built-in set isn't ordered right? Is vllm faster when you pass 1 prompt with n=N outputs, than with N times the same prompt for n=1?

Unfortunately, set is not ordered. Yes vllm can share the prefill for the n generations so it is faster, I profiled around 1.5x faster with the changes in this PR at 2k max_completion_length.

qgallouedec · 2025-02-13T18:24:10Z

Nice!! I am surprised, I expected a smaller speedup given that the prefix should already be reused since #2757.

We should probably do the same with tranformers generation in a future PR, if it makes sense.

Anyway, can you just add comment somewhere to explain why we do this?

winglian · 2025-02-13T22:54:59Z

My guess is it's an easier optimization for vllm to understand that single prompt has multiple generations than sending the same prompt multiple times from the #2776 refactor.

edbeeching · 2025-02-14T08:45:46Z

@qgallouedec, without diving into the codebase of vllm, I would assume that the prefix cache is only used to compare a new batch of prompts with previously processed prompts. The system prompt is shared across all prompts, so this is cached and reused for all batches, whereas a new batch of prompts would first all need to have their prefill calculated and entered into the cache before vllm could identify that there are num_generations of the prompts are exactly the same. Hence you get some improvement when you move from prompt*num_generations to n generations for each prompt.

Let me know if you would like me to clarify.

qgallouedec · 2025-02-14T08:57:18Z

Thanks Ed! Actually I meant adding a comment in the code to concisely explain why we merge the prompts. Something like Since 'prompts' contains 'num_generations' duplicates, we first take unique prompts, and generate num_generations outputs for each one. This is faster than generating outputs for each duplicate prompt individually..

qgallouedec

Nice! Feel free to merge

* small optimization of vllm batching * style * adds comment * style

small optimization of vllm batching

fbe6b89

edbeeching requested review from kashif, lewtun and qgallouedec February 13, 2025 17:21

edbeeching added 2 commits February 14, 2025 09:08

style

e97733a

adds comment

b15dc1f

qgallouedec approved these changes Feb 15, 2025

View reviewed changes

style

1de9e60

edbeeching merged commit 963243a into main Feb 18, 2025
14 checks passed

edbeeching deleted the vllm-batching branch February 18, 2025 10:44

yxliu-TAMU pushed a commit to mincheolseong/ECEN743-GRPO-Project-Proposal that referenced this pull request Apr 20, 2025

Optimize vllm num_generations (huggingface#2855)

cf98951

* small optimization of vllm batching * style * adds comment * style

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize vllm num_generations #2855

Optimize vllm num_generations #2855

Uh oh!

edbeeching commented Feb 13, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 13, 2025

Uh oh!

qgallouedec commented Feb 13, 2025

Uh oh!

edbeeching commented Feb 13, 2025

Uh oh!

qgallouedec commented Feb 13, 2025

Uh oh!

winglian commented Feb 13, 2025

Uh oh!

edbeeching commented Feb 14, 2025

Uh oh!

qgallouedec commented Feb 14, 2025 •

edited

Loading

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

Uh oh!

Optimize vllm num_generations #2855

Optimize vllm num_generations #2855

Uh oh!

Conversation

edbeeching commented Feb 13, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 13, 2025

Uh oh!

qgallouedec commented Feb 13, 2025

Uh oh!

edbeeching commented Feb 13, 2025

Uh oh!

qgallouedec commented Feb 13, 2025

Uh oh!

winglian commented Feb 13, 2025

Uh oh!

edbeeching commented Feb 14, 2025

Uh oh!

qgallouedec commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qgallouedec commented Feb 14, 2025 •

edited

Loading