feat: add nemotron5 sharding #481

gshennvm · 2025-06-04T19:58:42Z

adds nm5 sharding, it supports 32K context in my small script testing.

No need to shard mamba layers, act ckpt is enough. shards only the mlps

nemo_rl/models/dtensor/parallelize.py

terrykong · 2025-06-07T06:47:28Z

@gshennvm do you mind sharing command + wandb plots for posterity?

terrykong · 2025-08-08T17:35:35Z

was putting out other fires, will come back to this PR soon. For context, the issue now is that we create a dummy mamba model to run unit tests, but mamba unfortunately needs mamba to be installed to even import the model class

needs some thought

- Add mamba-ssm and causal-conv1d dependencies to automodel and vllm extras - Configure git sources for mamba-ssm and causal-conv1d packages - Add no-build-isolation for mamba-ssm and causal-conv1d - Implement _parallelize_nm5_h function for NemotronHForCausalLM parallelization - Update related unit tests for new parallelization functionality Signed-off-by: Terry Kong <terryk@nvidia.com> fix stuff Signed-off-by: Terry Kong <terryk@nvidia.com> gerald's fix for 32k Signed-off-by: Terry Kong <terryk@nvidia.com> fix the tests Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

commit b246e55 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 5315a6b Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 4437402 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> wip Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> add script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> interactive Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit b721703 Author: Charlie Truong <chtruong@nvidia.com> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit 70b9666 Author: Charlie Truong <chtruong@nvidia.com> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit df31c1b Author: pjin-nvidia <pjin@nvidia.com> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <pjin@nvidia.com> commit 83c6bfc Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 9f7825e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <ruit@nvidia.com> commit e1f56c4 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 223bfa8 Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> commit 18b9e2c Author: Terry Kong <terrycurtiskong@gmail.com> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 8fd8c96 Author: guyueh1 <140554423+guyueh1@users.noreply.github.com> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <guyueh@nvidia.com> commit 2b87def Author: Qidong Su <soodoshll@gmail.com> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <qidongs@nvidia.com> commit fecf71e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <ruit@nvidia.com> commit d45ff3f Author: Terry Kong <terrycurtiskong@gmail.com> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <terryk@nvidia.com> commit d73c942 Author: Anna Shors <ashors@nvidia.com> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Signed-off-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> commit e924d33 Author: Shang Wang <samshang.wang@mail.utoronto.ca> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca> commit bbbb3d6 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 88a399e Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit b8a89a9 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 5910abb Author: Anna Shors <ashors@nvidia.com> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <ashors@nvidia.com> commit 0988a7d Author: Felipe Vieira Frujeri <ffrujeri@gmail.com> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> commit 233cc07 Author: Parth Chadha <pchadha@nvidia.com> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <pchadha@nvidia.com> commit 0557402 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 03472a0 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 9af0a52 Author: Anna Shors <ashors@nvidia.com> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <ashors@nvidia.com> commit b6269f7 Author: Yubo Gao <yubog@nvidia.com> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <yubog@nvidia.com> commit b74c5d0 Author: Wei Du <wedu@nvidia.com> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> commit c784dd9 Author: Zhiyu Li <zhiyul@NVIDIA.com> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> commit c249efc Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

gshennvm changed the title ~~Geshen/nm5~~ add nemotron5 sharding Jun 4, 2025

gshennvm self-assigned this Jun 4, 2025

gshennvm added the Run CICD label Jun 4, 2025

gshennvm changed the title ~~add nemotron5 sharding~~ feat: add nemotron5 sharding Jun 4, 2025

gshennvm requested a review from terrykong June 4, 2025 20:36

gshennvm added the CI:L0 Run doctests and unit tests label Jun 4, 2025

gshennvm temporarily deployed to nemo-ci June 4, 2025 20:38 — with GitHub Actions Inactive

gshennvm added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jun 4, 2025

gshennvm had a problem deploying to nemo-ci June 4, 2025 22:45 — with GitHub Actions Failure

terrykong reviewed Jun 7, 2025

View reviewed changes

nemo_rl/models/dtensor/parallelize.py Show resolved Hide resolved

nemo_rl/models/dtensor/parallelize.py Show resolved Hide resolved

gshennvm force-pushed the geshen/nm5 branch from b0d4499 to 1fa9fef Compare June 30, 2025 18:29

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Jul 15, 2025

terrykong had a problem deploying to nemo-ci July 15, 2025 23:41 — with GitHub Actions Failure

terrykong temporarily deployed to nemo-ci July 16, 2025 21:22 — with GitHub Actions Inactive

terrykong added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 26, 2025

terrykong force-pushed the geshen/nm5 branch 3 times, most recently from c809cdb to a95b9d6 Compare July 31, 2025 05:55

terrykong enabled auto-merge August 4, 2025 06:14

terrykong previously approved these changes Aug 4, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 4, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 4, 2025

terrykong added research Tag for research team's issues and removed Run CICD labels Aug 7, 2025

terrykong dismissed their stale review via ee24ee3 August 11, 2025 07:14

terrykong force-pushed the geshen/nm5 branch from ee24ee3 to c6687d3 Compare August 11, 2025 07:16

terrykong force-pushed the geshen/nm5 branch from c6687d3 to 2d8cec0 Compare August 11, 2025 07:17

terrykong previously approved these changes Aug 11, 2025

View reviewed changes

terrykong enabled auto-merge August 11, 2025 07:18

terrykong added this pull request to the merge queue Aug 11, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 11, 2025

fix unit test by offering a prepare script

f3111c9

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong dismissed their stale review via f3111c9 August 12, 2025 01:13

github-actions bot added the documentation Improvements or additions to documentation label Aug 12, 2025

terrykong added 2 commits August 12, 2025 01:15

nit

3ddba01

Signed-off-by: Terry Kong <terryk@nvidia.com>

here we go

c0ecc2e

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong approved these changes Aug 12, 2025

View reviewed changes

terrykong enabled auto-merge August 12, 2025 01:17

terrykong added this pull request to the merge queue Aug 12, 2025

Merged via the queue into main with commit 223bfa8 Aug 12, 2025
19 checks passed

terrykong deleted the geshen/nm5 branch August 12, 2025 03:35

soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025

feat: add nemotron5 sharding (NVIDIA-NeMo#481)

e510dea

Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

jveronvialard pushed a commit that referenced this pull request Aug 27, 2025

feat: add nemotron5 sharding (#481)

8b50759

Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add nemotron5 sharding #481

feat: add nemotron5 sharding #481

Uh oh!

gshennvm commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

terrykong commented Jun 7, 2025

Uh oh!

Uh oh!

terrykong commented Aug 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: add nemotron5 sharding #481

feat: add nemotron5 sharding #481

Uh oh!

Conversation

gshennvm commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

terrykong commented Jun 7, 2025

Uh oh!

Uh oh!

terrykong commented Aug 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!