[GRPO] Faster generation at the 7B scale

### Feature request

I have been running experiments in the open-r1 project on 7B Instruct and Reasoning models on code, although the same observations can be seen on Mathmatics datasets as well.

For reasoning models, generation in still a bottleneck, here green is the generation time from `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` and blue from `Qwen/Qwen2.5-7B-Instruct`

![Image](https://github.com/user-attachments/assets/8ae0e6c4-a5ca-464b-baed-5fa2f9491f90)

We (@lewtun and @edbeeching ) would like to know what can be done to improve the generation time. Which is is over 5 minutes in some cases.
There are a few things to explore:

1. Benchmarking the `trl vllm-serve` generation time on the 7B instruct and reasoning models detailed above. Expand the benchmark to include a mix TP / PP options. Look at how the average generation time varies in a larger batch setting (see point 3). @shirinyamani I think this would be a great task for you, this dataset is good source of prompts for the models: https://huggingface.co/datasets/open-r1/OpenR1-Math-cn_k12-86k)
2. I believe in the 7B setting, it is possible to host the model on a single device (H100), vlm does not support DDP natively, but perhaps we could implement something. One idea is that in the 2 node setting, there are 8 accelerate processes on the node running the optimization loop. We could spawn 8 independent vllm instances on the second node and have each accelerate process send prompts to its own dedicated vllm instance, which would be specified by a unique port per process.
3. The final point is how we send batches to the vllm instance. If I have understood correctly, the prompts are not grouped based on the number of gradient accumulation steps, could a "mega-batch" of prompts be sampled so that vllm can further benefit from its scheduler and continuous batching. It could be that this is already the case, I am sure @qgallouedec can answer this.



### Motivation

-

### Your contribution

-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GRPO] Faster generation at the 7B scale #3195

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GRPO] Faster generation at the 7B scale #3195

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions