chore: upgrade vllm to v0.10.0 #766

yuki-97 · 2025-07-28T08:54:54Z

What does this PR do ?

Upgrade vllm from v0.9.0 to v0.10.0 .

This could help:

NCCL issue: NCCL error when using non-colocated generation on single node #722
1. Thanks @Dazz993 for helping root cause this.
gemma3 model support: fix: Fix gemma models broken by HF update #676
VLM support.

Test Result

Llama3 (FSDP2 and mcore backend)

configs of FSDP2 and mcore have some differences.

	convergence	time
FSDP2
mcore

gemma3 1b and 27b (FSDP2 backend)

	convergence	time
1b
27b

DSV3 (mcore backend)

convergence	time

VLM (FSDP2 backend)

llava-hf/llava-1.5-7b-hf, HuggingFaceTB/SmolVLM2-2.2B-Instruct, llava-hf/llava-1.5-7b-hf work fine.

convergence	time

Qwen/Qwen2.5-VL-3B-Instruct, google/gemma-3-4b-it 's reward is fine, but have unexpected token_mult_prob_error.
This won't block upgrade vllm to v0.10.0, since v0.9.2 has the similar logprob behavior, traced here #793.

convergence	time

non-colocated on single-node (FSDP2 backend)

Test Qwen2.5-1.5B using 4 GPUs for train and 4 GPUs for inference.

convergence	time

non-colocated w/ async vllm on multi-nodes (FSDP2 backend)

Test Llama-3.1-8B-Instruct using 2 nodes for train and 1 node for inference.

convergence	time

Issues

Closes #722.

terrykong

Since v0.10.0 is out, are we able to jump straight to that one?

https://github.com/vllm-project/vllm/releases/tag/v0.10.0

SahilJain314 · 2025-07-28T20:56:23Z

The release notes from 0.10.0 look good. Particularly this:

RLHF Support: new RPC methods for runtime weight reloading (https://github.com/vllm-project/vllm/pull/20096) and config updates (https://github.com/vllm-project/vllm/pull/20095), logprobs mode for selecting which stage of logprobs to return (https://github.com/vllm-project/vllm/pull/21398).

We need to be careful to select the post-processing logprobs ones for our case. The RPC methods should be useful as well.

SahilJain314 · 2025-07-30T20:03:23Z

Blocking merging #543 until vllm is updated to 0.10.0 here and merged

yuki-97 · 2025-07-31T13:37:13Z

We need to be careful to select the post-processing logprobs ones for our case.

w/ vllm==0.10.0, logprobs_mode is using "raw_logprobs" as default, which is the same as our previous usage.
@SahilJain314 do you think I need to add logprobs_mode="raw_logprobs" when init vllm worker?

parthchadha · 2025-07-31T17:14:28Z

@yuki-666 lets force init to always use raw_logprobs in case the default in vllm is changed in future releases.

SahilJain314 · 2025-07-31T18:44:21Z

raw logprobs are fine for now (same as v1). But are actually incorrect for top-k, top-p sampling (even though it appears as though it would work correctly). Both the logprobs from vllm and the training framework are incorrect since they haven't been processed, which introduces error. (I believe the claim from #773 that v1 is fine is incorrect)

We have 2 options for how to mitigate this:

use the post-processing from feat: [1/2] Top-k and Top-p support for dtensor worker with vLLM V0 when TP==1 #773 on both vllm raw_logprobs and megatron training
use vllm post-processed logprobs and apply the processing to just megatron.

I'm leaning strongly towards the latter as it leaves less room for us to make mistakes and not notice. We should track this in #69.

SahilJain314 · 2025-07-31T18:46:34Z

@yuki-666 I see in some of the plots that the new vllm version is much more 'spiky' in logprob error. This is fine (spikes are occasional) but would you mind also adding plots with the logprob error clamped to a max value of 1.2? I'd like to confirm that the typical value has not increased at all.

nemo_rl/models/generation/vllm.py

tests/unit/models/generation/test_vllm_generation.py

terrykong · 2025-08-01T21:10:51Z

Also want to just leave a comment that 0.10.0 version may fix this issue I've observed in gemma runs

vllm-project/vllm#19449

Signed-off-by: Yuki Huang <yukih@nvidia.com>

This reverts commit 9389dfb.

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

github-actions bot added the documentation Improvements or additions to documentation label Jul 28, 2025

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jul 28, 2025

yuki-97 temporarily deployed to nemo-ci July 28, 2025 12:47 — with GitHub Actions Inactive

yuki-97 requested review from yfw and terrykong July 28, 2025 12:47

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 28, 2025

yuki-97 temporarily deployed to nemo-ci July 28, 2025 15:16 — with GitHub Actions Inactive

terrykong reviewed Jul 28, 2025

View reviewed changes

yuki-97 marked this pull request as draft July 29, 2025 08:48

yuki-97 changed the title ~~chore: upgrade vllm from v0.9.0 to v0.9.2~~ chore: upgrade vllm to v0.10.0 Jul 31, 2025

yuki-97 force-pushed the yukih/vllm-0.9.2 branch from 0ac0567 to fe57f35 Compare July 31, 2025 13:13

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 31, 2025

yuki-97 had a problem deploying to nemo-ci July 31, 2025 13:13 — with GitHub Actions Error

yuki-97 force-pushed the yukih/vllm-0.9.2 branch from fe57f35 to a16f198 Compare July 31, 2025 13:23

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 31, 2025

yuki-97 temporarily deployed to nemo-ci July 31, 2025 13:24 — with GitHub Actions Inactive

yuki-97 marked this pull request as ready for review July 31, 2025 13:31

yuki-97 requested review from parthchadha, rohitrango and SahilJain314 July 31, 2025 13:31

SahilJain314 reviewed Jul 31, 2025

View reviewed changes

nemo_rl/models/generation/vllm.py Show resolved Hide resolved

tests/unit/models/generation/test_vllm_generation.py Outdated Show resolved Hide resolved

yuki-97 added 8 commits August 4, 2025 05:27

upgrade vllm to 0.9.2

f10f9a9

Signed-off-by: Yuki Huang <yukih@nvidia.com>

remove gemma3 patch

3a6fccc

Signed-off-by: Yuki Huang <yukih@nvidia.com>

remove tmp fix for non-colocated

155beb2

Signed-off-by: Yuki Huang <yukih@nvidia.com>

upgrade vllm to 0.10.0

e9316ff

Signed-off-by: Yuki Huang <yukih@nvidia.com>

set skip_tokenizer_init=False as a workaround

bed0b75

Signed-off-by: Yuki Huang <yukih@nvidia.com>

remove transformers version limit

ee0eb66

Signed-off-by: Yuki Huang <yukih@nvidia.com>

revert no limit transformers since it's crash at rm

0d363bb

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add logprobs_mode=raw_logprobs

f2887a0

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/vllm-0.9.2 branch from 08d2789 to f2887a0 Compare August 4, 2025 05:28

yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 4, 2025

yuki-97 temporarily deployed to nemo-ci August 4, 2025 05:29 — with GitHub Actions Inactive

SahilJain314 approved these changes Aug 4, 2025

View reviewed changes

terrykong enabled auto-merge August 4, 2025 19:33

terrykong added this pull request to the merge queue Aug 4, 2025

rohitrango approved these changes Aug 4, 2025

View reviewed changes

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Aug 5, 2025

yuki-97 mentioned this pull request Aug 5, 2025

feat: support non-colocated in mcore #613

Merged

yuki-97 added a commit that referenced this pull request Aug 5, 2025

squash #766

6974379

Signed-off-by: Yuki Huang <yukih@nvidia.com>

terrykong added this pull request to the merge queue Aug 5, 2025

Merged via the queue into main with commit 9389dfb Aug 5, 2025
35 checks passed

terrykong deleted the yukih/vllm-0.9.2 branch August 5, 2025 19:15

This was referenced Aug 5, 2025

Top-p/Top-k Sampling Params handling in VLLM v1 #69

Open

feat: [1/2] Top-k and Top-p support for dtensor worker with vLLM V0 when TP==1 #773

Open

ashors1 added a commit that referenced this pull request Aug 7, 2025

Revert "chore: upgrade vllm to v0.10.0 (#766)"

7ec5a28

This reverts commit 9389dfb.

soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025

chore: upgrade vllm to v0.10.0 (NVIDIA-NeMo#766)

f99194a

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

yuki-97 mentioned this pull request Aug 21, 2025

[multimodal dtensor] Inconsistent logprobs for multimodal models #793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: upgrade vllm to v0.10.0 #766

chore: upgrade vllm to v0.10.0 #766

Uh oh!

yuki-97 commented Jul 28, 2025 •

edited

Loading

Uh oh!

terrykong left a comment

Uh oh!

SahilJain314 commented Jul 28, 2025 •

edited

Loading

Uh oh!

SahilJain314 commented Jul 30, 2025 •

edited

Loading

Uh oh!

yuki-97 commented Jul 31, 2025

Uh oh!

parthchadha commented Jul 31, 2025

Uh oh!

SahilJain314 commented Jul 31, 2025

Uh oh!

SahilJain314 commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

terrykong commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chore: upgrade vllm to v0.10.0 #766

chore: upgrade vllm to v0.10.0 #766

Uh oh!

Conversation

yuki-97 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Test Result

Llama3 (FSDP2 and mcore backend)

gemma3 1b and 27b (FSDP2 backend)

DSV3 (mcore backend)

VLM (FSDP2 backend)

non-colocated on single-node (FSDP2 backend)

non-colocated w/ async vllm on multi-nodes (FSDP2 backend)

Issues

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

SahilJain314 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SahilJain314 commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuki-97 commented Jul 31, 2025

Uh oh!

parthchadha commented Jul 31, 2025

Uh oh!

SahilJain314 commented Jul 31, 2025

Uh oh!

SahilJain314 commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

terrykong commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuki-97 commented Jul 28, 2025 •

edited

Loading

SahilJain314 commented Jul 28, 2025 •

edited

Loading

SahilJain314 commented Jul 30, 2025 •

edited

Loading