feat: GRPO + SFT Dtensor support for multimodal training #712

rohitrango · 2025-07-22T21:44:20Z

What does this PR do ?

Adds image / video VLM support for supervised finetuning and GRPO using dtensor policy. Solves #85

Tested models:

Qwen2VL / Qwen2.5VL
Llava 1.5 / Llava Next / Llava Next Video / Llava OneVision
Huggingface SmolVLM2-2.2B-Instruct
Gemma3 4B

Tested datasets:

Geometry3k
CLEVR
RefCOCO

🔪 Sharp Edges

Although training runs converge, logprob error between vllm and hf model is higher than 1.05 consistently. Issue tracked in #793 .

Edit: Only in Gemma3. logprob issue is fixed in Llava, SmolVLM, Qwen2, 2.5VL

Usage

uv run examples/run_sft.py --config examples/configs/sft_clevr.yaml  cluster.gpus_per_node=4
uv run examples/run_vlm_grpo.py cluster.gpus_per_node=4

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

github-actions · 2025-07-29T18:31:12Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: cc2986f (PR #712 from rohit/sft_vlm)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/terrykong/Megatron-LM/commits/ed5c792f2a8ffe357c871f4547a8fe905a09b835/

NeMo: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/33259f2540af6eef375d43fc48bdcbd7ec490c29/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/NVIDIA/NeMo/commits/0e0894300e09aca042bc07859f660f22858f0a9f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-07-29T18:46:51Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 919a7ce (PR #712 from rohit/sft_vlm)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/terrykong/Megatron-LM/commits/ed5c792f2a8ffe357c871f4547a8fe905a09b835/

NeMo: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/33259f2540af6eef375d43fc48bdcbd7ec490c29/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/NVIDIA/NeMo/commits/0e0894300e09aca042bc07859f660f22858f0a9f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

terrykong · 2025-07-29T18:50:51Z

copying over the last message from @rohitrango from #655

re: Remaining blockers:

understanding the logprob error: This is something I want to chalk up to how vllm loads multimodal image embeddings in the image processor. For LLM-only, I noted that vllm takes the same list of token_ids (int value list) that the policy consumes (i.e. going through the same text embedding layer, etc.). However, for multimodal images, vllm processes the images internally. There could also be differences in how sampling is done differently. I found the following excerpt from vllm docs https://docs.vllm.ai/en/v0.9.1/usage/v1_guide.html#feature-model

Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty adjustments). As a result, the returned logprobs do not reflect the final adjusted probabilities used during sampling.
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.

I prefer handling this issue in a separate PR (and merging an initial support first) for ~~three~~ four reasons:

this discrepancy is isolated to multimodal models only, so a "fix" can be shipped independently
multiple VLMs converge on three different datasets despite the apparent discrepancy. It is equivalent to training GRPO with a slightly off-policy model, but it does not seem to be very unstable or destructive to the learning process
other PRs break multimodal support regularly (every 2-3 days) and I have to rollback / fix those changes in my PR to make my scripts work. Merging this PR or at least the test cases will prevent other PRs from breaking multimodal support
the PR has gotten very big as it is, and adding more fixes will add additional overhead to the review process

PR has now migrated (again) to feat: GRPO + SFT Dtensor support for multimodal training #712, and is tested on 4 families of multimodal models and 3 datasets. This rollsback the passing around of the vlm_kwargs list throughout the training process and instead proposes a PackedGenericDataItem to handle non-sequence data items (most of them would be multimodal tensors). The single implementation seems to work for multiple multimodal models without any additional modifications to the config.

github-actions · 2025-07-29T18:54:21Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 80d9ff5 (PR #712 from rohit/sft_vlm)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/terrykong/Megatron-LM/commits/ed5c792f2a8ffe357c871f4547a8fe905a09b835/

NeMo: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/33259f2540af6eef375d43fc48bdcbd7ec490c29/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/NVIDIA/NeMo/commits/0e0894300e09aca042bc07859f660f22858f0a9f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

terrykong · 2025-07-29T18:55:10Z

@rohitrango My understanding is that currently the logprob issue may be from input processing not matching inside vllm vs outside. The excerpt you shared is related to sampling, so I think it still remains to be seen whether this is a bug or expected

terrykong · 2025-07-29T18:56:00Z

As far as keeping up with changes from main, if you rebase and encounter conflicts, it's advised to squash your commits since you'll only hit the conflict once as opposed to several times for each hunk in the branch that has touched that area.

rohitrango · 2025-07-29T19:12:09Z

re: keeping up with changes from main , the merge commits are not as big of an issue. The bigger issue is changes that break the multimodal training loop (like adding extra parameters to the dtensor path that is only supported for LLMs, or introducing a model.lm_head somewhere - for VLMs the module would be model.language_model.lm_head), etc.

This basically means I have to debug a working GRPO/SFT training loop every 2 days after merging from main.

The multimodal test cases are expected to block all such changes.

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: rohitrango <rohit.rango@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

…#712) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: rohitrango <rohit.rango@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Yi-Fu Wu <yifu.wu@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

rohitrango temporarily deployed to public July 22, 2025 21:44 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 25, 2025 06:11 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 25, 2025 19:55 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 26, 2025 08:34 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 26, 2025 08:38 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 28, 2025 17:48 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 28, 2025 21:18 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 28, 2025 23:18 — with GitHub Actions Inactive

rohitrango changed the base branch from rohit/vlm_grpo to main July 29, 2025 01:49

rohitrango changed the title ~~feat: SFT support for multimodal training (VLM)~~ feat: GRPO + SFT support for multimodal training Jul 29, 2025

rohitrango temporarily deployed to public July 29, 2025 02:35 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 15:27 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 15:39 — with GitHub Actions Inactive

rohitrango mentioned this pull request Jul 29, 2025

feat: v0 VLM support + GRPO pipeline #655

Closed

4 tasks

rohitrango temporarily deployed to public July 29, 2025 17:17 — with GitHub Actions Inactive

rohitrango marked this pull request as ready for review July 29, 2025 18:30

rohitrango temporarily deployed to public July 29, 2025 18:32 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 18:48 — with GitHub Actions Inactive

terrykong changed the title ~~feat: GRPO + SFT support for multimodal training~~ feat: GRPO + SFT Dtensor support for multimodal training Jul 29, 2025

rohitrango temporarily deployed to public July 29, 2025 18:55 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 19:27 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 21:05 — with GitHub Actions Inactive

rohitrango mentioned this pull request Jul 29, 2025

[multimodal dtensor] Inconsistent logprobs for multimodal models #793

Open

yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 19, 2025

yfw temporarily deployed to nemo-ci August 19, 2025 07:50 — with GitHub Actions Inactive

yfw temporarily deployed to nemo-ci August 19, 2025 07:51 — with GitHub Actions Inactive

yfw had a problem deploying to nemo-ci August 19, 2025 09:15 — with GitHub Actions Failure

Fix seq packing check

c10c9fc

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 19, 2025

yfw temporarily deployed to nemo-ci August 19, 2025 14:51 — with GitHub Actions Inactive

yfw temporarily deployed to nemo-ci August 19, 2025 14:52 — with GitHub Actions Inactive

yfw temporarily deployed to public August 19, 2025 14:52 — with GitHub Actions Inactive

Fix reward model check

4aee2a8

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw temporarily deployed to public August 19, 2025 15:39 — with GitHub Actions Inactive

Move property to top

23e2b78

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw temporarily deployed to public August 19, 2025 15:45 — with GitHub Actions Inactive

yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 19, 2025

yfw temporarily deployed to nemo-ci August 19, 2025 15:47 — with GitHub Actions Inactive

yfw temporarily deployed to nemo-ci August 19, 2025 17:14 — with GitHub Actions Inactive

terrykong approved these changes Aug 19, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 19, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 19, 2025

terrykong added this pull request to the merge queue Aug 19, 2025

Merged via the queue into main with commit eb50202 Aug 19, 2025
39 of 40 checks passed

terrykong deleted the rohit/sft_vlm branch August 19, 2025 23:51

ffrujeri mentioned this pull request Sep 2, 2025

feat: Integrate vlm changes between DTensorPolicyWorker V1 and V2. #982

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: GRPO + SFT Dtensor support for multimodal training #712

feat: GRPO + SFT Dtensor support for multimodal training #712

Uh oh!

rohitrango commented Jul 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

terrykong commented Jul 29, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

rohitrango commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: GRPO + SFT Dtensor support for multimodal training #712

feat: GRPO + SFT Dtensor support for multimodal training #712

Uh oh!

Conversation

rohitrango commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

🔪 Sharp Edges

Usage

Before your PR is "Ready for review"

Uh oh!

github-actions bot commented Jul 29, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jul 29, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

terrykong commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 29, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

rohitrango commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rohitrango commented Jul 22, 2025 •

edited

Loading

terrykong commented Jul 29, 2025 •

edited

Loading