Skip to content

Conversation

yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Jul 28, 2025

What does this PR do ?

Upgrade vllm from v0.9.0 to v0.10.0 .

This could help:

  1. NCCL issue: NCCL error when using non-colocated generation on single node #722
    1. Thanks @Dazz993 for helping root cause this.
  2. gemma3 model support: fix: Fix gemma models broken by HF update #676
  3. VLM support.

Test Result

Llama3 (FSDP2 and mcore backend)

configs of FSDP2 and mcore have some differences.

convergence time
FSDP2
image image
mcore
image image

gemma3 1b and 27b (FSDP2 backend)

convergence time
1b
image image
27b
image image

DSV3 (mcore backend)

convergence time
image image

VLM (FSDP2 backend)

llava-hf/llava-1.5-7b-hf, HuggingFaceTB/SmolVLM2-2.2B-Instruct, llava-hf/llava-1.5-7b-hf work fine.

convergence time
image image

Qwen/Qwen2.5-VL-3B-Instruct, google/gemma-3-4b-it 's reward is fine, but have unexpected token_mult_prob_error.
This won't block upgrade vllm to v0.10.0, since v0.9.2 has the similar logprob behavior, traced here #793.

convergence time
image image

non-colocated on single-node (FSDP2 backend)

Test Qwen2.5-1.5B using 4 GPUs for train and 4 GPUs for inference.

convergence time
image image

non-colocated w/ async vllm on multi-nodes (FSDP2 backend)

Test Llama-3.1-8B-Instruct using 2 nodes for train and 1 node for inference.

convergence time
image image

Issues

Closes #722.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 28, 2025
@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jul 28, 2025
@yuki-97 yuki-97 requested review from yfw and terrykong July 28, 2025 12:47
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 28, 2025
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since v0.10.0 is out, are we able to jump straight to that one?

https://github.com/vllm-project/vllm/releases/tag/v0.10.0

@SahilJain314
Copy link
Contributor

SahilJain314 commented Jul 28, 2025

The release notes from 0.10.0 look good. Particularly this:

RLHF Support: new RPC methods for runtime weight reloading (https://github.com/vllm-project/vllm/pull/20096) and config updates (https://github.com/vllm-project/vllm/pull/20095), logprobs mode for selecting which stage of logprobs to return (https://github.com/vllm-project/vllm/pull/21398).

We need to be careful to select the post-processing logprobs ones for our case. The RPC methods should be useful as well.

@yuki-97 yuki-97 marked this pull request as draft July 29, 2025 08:48
@SahilJain314
Copy link
Contributor

SahilJain314 commented Jul 30, 2025

Blocking merging #543 until vllm is updated to 0.10.0 here and merged

@yuki-97 yuki-97 changed the title chore: upgrade vllm from v0.9.0 to v0.9.2 chore: upgrade vllm to v0.10.0 Jul 31, 2025
@yuki-97 yuki-97 force-pushed the yukih/vllm-0.9.2 branch from 0ac0567 to fe57f35 Compare July 31, 2025 13:13
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 31, 2025
@yuki-97 yuki-97 force-pushed the yukih/vllm-0.9.2 branch from fe57f35 to a16f198 Compare July 31, 2025 13:23
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 31, 2025
@yuki-97 yuki-97 marked this pull request as ready for review July 31, 2025 13:31
@yuki-97
Copy link
Contributor Author

yuki-97 commented Jul 31, 2025

We need to be careful to select the post-processing logprobs ones for our case.

w/ vllm==0.10.0, logprobs_mode is using "raw_logprobs" as default, which is the same as our previous usage.
@SahilJain314 do you think I need to add logprobs_mode="raw_logprobs" when init vllm worker?

@parthchadha
Copy link
Contributor

@yuki-666 lets force init to always use raw_logprobs in case the default in vllm is changed in future releases.

@SahilJain314
Copy link
Contributor

raw logprobs are fine for now (same as v1). But are actually incorrect for top-k, top-p sampling (even though it appears as though it would work correctly). Both the logprobs from vllm and the training framework are incorrect since they haven't been processed, which introduces error. (I believe the claim from #773 that v1 is fine is incorrect)

We have 2 options for how to mitigate this:

I'm leaning strongly towards the latter as it leaves less room for us to make mistakes and not notice. We should track this in #69.

@SahilJain314
Copy link
Contributor

@yuki-666 I see in some of the plots that the new vllm version is much more 'spiky' in logprob error. This is fine (spikes are occasional) but would you mind also adding plots with the logprob error clamped to a max value of 1.2? I'd like to confirm that the typical value has not increased at all.

@terrykong
Copy link
Contributor

Also want to just leave a comment that 0.10.0 version may fix this issue I've observed in gemma runs

vllm-project/vllm#19449

yuki-97 added 8 commits August 4, 2025 05:27
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 4, 2025
@terrykong terrykong enabled auto-merge August 4, 2025 19:33
@terrykong terrykong added this pull request to the merge queue Aug 4, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Aug 5, 2025
yuki-97 added a commit that referenced this pull request Aug 5, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@terrykong terrykong added this pull request to the merge queue Aug 5, 2025
Merged via the queue into main with commit 9389dfb Aug 5, 2025
35 checks passed
@terrykong terrykong deleted the yukih/vllm-0.9.2 branch August 5, 2025 19:15
ashors1 added a commit that referenced this pull request Aug 7, 2025
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L0 Run doctests and unit tests documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NCCL error when using non-colocated generation on single node
6 participants