feat: general fsdp2 on non-MoE models + HF TP plan #352

yuki-97 · 2025-05-12T13:48:04Z

What does this PR do ?

Support FSDP2 on non-MoE models.
Support Hugging Face TP plan.
The priority for using which parallel plan is custom-parallel-plan > opt-parallel-plan (which we implemented for certain models in FSDP2) > hf-tp-plan (HF's _tp_plan).

Convergence test on LlamaForCausalLM, Qwen2ForCausalLM, Qwen3ForCausalLM, Gemma2ForCausalLM, Gemma3ForCausalLM, Phi3ForCausalLM run well.

Convergence Test Detail

Llama-3.1-8B-Instruct (LlamaForCausalLM)
FSDP2-tp8-opt_plan vs FSDP2-tp8-hf_tp_plan

Qwen2ForCausalLM / Qwen3ForCausalLM

Qwen2.5-7B-Instruct (Qwen2ForCausalLM) FSDP2-tp4-opt_plan vs FSDP2-tp4-hf_tp_plan	Qwen3-0.6B (Qwen3ForCausalLM) FSDP1 vs FSDP2-tp1

Gemma2ForCausalLM / Gemma3ForCausalLM

gemma-2-9b-it (Gemma2ForCausalLM) FSDP1 vs FSDP2-tp1 vs FSDP2-tp4-hf_tp_plan	gemma-3-1b-it (Gemma3ForCausalLM) FSDP1 vs FSDP2-tp1

Issues

Closes #156

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

yuki-97 · 2025-05-20T13:57:15Z

File another issue #413 to trace FSDP2 for MoE models.

Qwen3-30B-A3B is obviously slower than Qwen3-32B, especially on the refit process or when using hf-tp-plan with dtensor tp > 1.
DeepseekV2ForCausalLM using fsdp2 will fail on the following error on model.layers.0.self_attn.rotary_emb.cos_cached, said v.shape=torch.Size([2048, 64]) and self.reference_model_buffers[k].shape=torch.Size([163840, 64]) shape mismatch in self.use_reference_model().

DeepseekV2ForCausalLM should use vllm==0.8.5 and add this line https://github.com/vllm-project/vllm/blob/v0.9.0/vllm/v1/attention/backends/mla/common.py#L494 to vllm for test.

docs/design-docs/fsdp2-parallel-plan.md

yuki-97 · 2025-05-22T07:39:04Z

@yuki-666 what were the parameters of your gemma2 run? I can't seem to get it to run correctly:
uv run examples/run_grpo_math.py policy.model_name=google/gemma-2-2b-it logger.wandb_enabled=True cluster.gpus_per_node=8 +policy.generation.vllm_cfg.load_format=auto

@terrykong Thanks very much for pointing out this!

I tested with almost the same script as yours before this commit fdb565c.
After this commit, load_format of vllm during training is default set to dummy, only specific models will change it through nemo_rl/models/huggingface/common.py. The param policy.generation.vllm_cfg.load_format is removed from yaml and has no effect even if we pass it.

It is fixed now, and other models won't be affect since they don't need special handle on load_format.

yuki-97 · 2025-05-22T07:47:57Z

Thanks @jgerh , have updated from your suggestions.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

terrykong · 2025-05-23T16:35:52Z

Thanks for the quick fix @yuki-666 . Gemma2 seems to be okay now from a quick run:

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/fsdp2-general branch 14 times, most recently from d5ad00d to 8e2e6f4 Compare May 20, 2025 09:59

github-actions bot added the documentation Improvements or additions to documentation label May 20, 2025

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label May 20, 2025

yuki-97 temporarily deployed to nemo-ci May 20, 2025 10:02 — with GitHub Actions Inactive

yuki-97 added the CI:docs Run doctest label May 20, 2025

yuki-97 temporarily deployed to nemo-ci May 20, 2025 13:30 — with GitHub Actions Inactive

yuki-97 changed the title ~~feat: general fsdp2~~ feat: general fsdp2 on non-MoE models + HF TP plan May 20, 2025

yuki-97 marked this pull request as ready for review May 20, 2025 13:59

yuki-97 requested review from terrykong, parthchadha, SahilJain314 and gshennvm May 20, 2025 14:00

parthchadha requested changes May 20, 2025

View reviewed changes

yuki-97 force-pushed the yukih/fsdp2-general branch from fc6cc49 to 05d8cfe Compare May 21, 2025 02:53

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 21, 2025

yuki-97 force-pushed the yukih/fsdp2-general branch 3 times, most recently from 08cce8c to 0dc55cc Compare May 22, 2025 06:59

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 22, 2025

yuki-97 temporarily deployed to nemo-ci May 22, 2025 07:03 — with GitHub Actions Inactive

yuki-97 added 9 commits May 23, 2025 17:46

support hf tp plan, add custom_parallel_plan param

d58103d

Signed-off-by: Yuki Huang <yukih@nvidia.com>

tidy up

807a0dc

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix model with model.language_model

440cd35

Signed-off-by: Yuki Huang <yukih@nvidia.com>

special with embed_tokens and lm_head for speed up

be8ddeb

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add doc and update custom_parallel_plan

3e6918f

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add unit test

9f259bb

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update config

10c4f05

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix gemma2

8ae4d32

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update doc and fix type

72a8f35

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/fsdp2-general branch from 0dc55cc to 72a8f35 Compare May 23, 2025 09:46

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 23, 2025

yuki-97 temporarily deployed to nemo-ci May 23, 2025 09:48 — with GitHub Actions Inactive

terrykong approved these changes May 23, 2025

View reviewed changes

parthchadha approved these changes May 23, 2025

View reviewed changes

parthchadha added this pull request to the merge queue May 23, 2025

SahilJain314 approved these changes May 23, 2025

View reviewed changes

Merged via the queue into main with commit 3db05c1 May 23, 2025
21 of 23 checks passed

parthchadha deleted the yukih/fsdp2-general branch May 23, 2025 22:10

YzjiaoNvd pushed a commit to YzjiaoNvd/NeMo-RL that referenced this pull request Jun 10, 2025

feat: general fsdp2 on non-MoE models + HF TP plan (NVIDIA-NeMo#352)

0396b66

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: general fsdp2 on non-MoE models + HF TP plan #352

feat: general fsdp2 on non-MoE models + HF TP plan #352

Uh oh!

yuki-97 commented May 12, 2025 •

edited by terrykong

Loading

Uh oh!

yuki-97 commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuki-97 commented May 22, 2025

Uh oh!

yuki-97 commented May 22, 2025

Uh oh!

terrykong commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

feat: general fsdp2 on non-MoE models + HF TP plan #352

feat: general fsdp2 on non-MoE models + HF TP plan #352

Uh oh!

Conversation

yuki-97 commented May 12, 2025 • edited by terrykong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Convergence Test Detail

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

yuki-97 commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuki-97 commented May 22, 2025

Uh oh!

yuki-97 commented May 22, 2025

Uh oh!

terrykong commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

yuki-97 commented May 12, 2025 •

edited by terrykong

Loading