feat: support non-colocated in mcore #613

yuki-97 · 2025-07-04T09:10:49Z

What does this PR do ?

Support non-colocated (both sync and async) in mcore worker.

Test Result

logprob_inference_prep is slower than colocated is because we need to offload optimizer here, colocated already offload it in refit_policy_generation.

llama3
colocated (baseline): using 2 node.
non-colocated: using 2 node for train and 1 node for inference.

Convergence Time Cost
dsv3
colocated (baseline): using 32 node.
non-colocated: using 32 node for train and 8/16 node for inference.

Convergence Time Cost

Issues

Closes #557.

Usage

Train resources will be inferred from overall and inference resources.
i.e. training nodes = overall nodes - inference nodes

# 1 node with 8 GPUs, 4 GPUs for train and 4 GPUs for inference
uv run python examples/run_grpo_math.py \
    --config examples/configs/grpo_math_1B_megatron.yaml \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.gpus_per_node=4 \
    cluster.num_nodes=1 \
    cluster.gpus_per_node=8

# 5 nodes with 8GPUs, 4 nodes for train and 1 node for inference
uv run python examples/run_grpo_math.py \
    --config examples/configs/grpo_math_1B_megatron.yaml \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.num_nodes=1 \
    cluster.num_nodes=5 \
    cluster.gpus_per_node=8

yuki-97 · 2025-07-07T12:13:05Z

I'd like to delay merging this PR after v0.3 for de-risk.
But if you have time, you can review it first, so that I can do some update a bit earlier.

nemo_rl/algorithms/grpo.py

yuki-97 · 2025-08-05T07:28:56Z

~~wait until #766 merged, need rebase and also bump vllm to v0.10.0.~~
done

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add init_collective Signed-off-by: Yuki Huang <yukih@nvidia.com> add broadcast Signed-off-by: Yuki Huang <yukih@nvidia.com> update prepare_refit_info Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

commit b246e55 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 5315a6b Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 4437402 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> wip Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> add script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> interactive Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit b721703 Author: Charlie Truong <chtruong@nvidia.com> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit 70b9666 Author: Charlie Truong <chtruong@nvidia.com> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit df31c1b Author: pjin-nvidia <pjin@nvidia.com> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <pjin@nvidia.com> commit 83c6bfc Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 9f7825e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <ruit@nvidia.com> commit e1f56c4 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 223bfa8 Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> commit 18b9e2c Author: Terry Kong <terrycurtiskong@gmail.com> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 8fd8c96 Author: guyueh1 <140554423+guyueh1@users.noreply.github.com> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <guyueh@nvidia.com> commit 2b87def Author: Qidong Su <soodoshll@gmail.com> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <qidongs@nvidia.com> commit fecf71e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <ruit@nvidia.com> commit d45ff3f Author: Terry Kong <terrycurtiskong@gmail.com> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <terryk@nvidia.com> commit d73c942 Author: Anna Shors <ashors@nvidia.com> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Signed-off-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> commit e924d33 Author: Shang Wang <samshang.wang@mail.utoronto.ca> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca> commit bbbb3d6 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 88a399e Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit b8a89a9 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 5910abb Author: Anna Shors <ashors@nvidia.com> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <ashors@nvidia.com> commit 0988a7d Author: Felipe Vieira Frujeri <ffrujeri@gmail.com> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> commit 233cc07 Author: Parth Chadha <pchadha@nvidia.com> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <pchadha@nvidia.com> commit 0557402 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 03472a0 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 9af0a52 Author: Anna Shors <ashors@nvidia.com> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <ashors@nvidia.com> commit b6269f7 Author: Yubo Gao <yubog@nvidia.com> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <yubog@nvidia.com> commit b74c5d0 Author: Wei Du <wedu@nvidia.com> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> commit c784dd9 Author: Zhiyu Li <zhiyul@NVIDIA.com> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> commit c249efc Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jul 4, 2025

yuki-97 temporarily deployed to nemo-ci July 4, 2025 09:14 — with GitHub Actions Inactive

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 4, 2025

yuki-97 temporarily deployed to nemo-ci July 4, 2025 12:47 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/non-colocated-mcore branch from e913e05 to 0ec3a7c Compare July 4, 2025 13:47

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 4, 2025

yuki-97 temporarily deployed to nemo-ci July 4, 2025 13:48 — with GitHub Actions Inactive

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 4, 2025

yuki-97 temporarily deployed to nemo-ci July 4, 2025 14:33 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/non-colocated-mcore branch 2 times, most recently from d4bc832 to 5d1efb8 Compare July 7, 2025 07:28

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 7, 2025

yuki-97 temporarily deployed to nemo-ci July 7, 2025 07:29 — with GitHub Actions Inactive

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 7, 2025

yuki-97 temporarily deployed to nemo-ci July 7, 2025 09:50 — with GitHub Actions Inactive

yuki-97 marked this pull request as ready for review July 7, 2025 12:12

yuki-97 requested review from terrykong, parthchadha and SahilJain314 July 7, 2025 12:12

parthchadha reviewed Jul 7, 2025

View reviewed changes

nemo_rl/algorithms/grpo.py Outdated Show resolved Hide resolved

yuki-97 mentioned this pull request Jul 8, 2025

NCCL error when using non-colocated generation and set_model_state_dict apis #564

Closed

zhandaz mentioned this pull request Jul 10, 2025

fix: fix nccl P2P initialization error for non-colocated #636

Merged

yuki-97 mentioned this pull request Jul 10, 2025

feat: optimize refit by preparing refit info ahead of time #638

Merged

yuki-97 force-pushed the yukih/non-colocated-mcore branch from 5d1efb8 to 239322a Compare July 23, 2025 07:44

yuki-97 added a commit that referenced this pull request Aug 5, 2025

squash #613

85f91c4

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 dismissed SahilJain314’s stale review via f9d50ff August 6, 2025 03:16

yuki-97 force-pushed the yukih/non-colocated-mcore branch 2 times, most recently from ed8f14b to 140810b Compare August 6, 2025 12:33

yuki-97 added 6 commits August 7, 2025 13:35

support non-colocated in mcore worker

396cdea

add init_collective Signed-off-by: Yuki Huang <yukih@nvidia.com> add broadcast Signed-off-by: Yuki Huang <yukih@nvidia.com> update prepare_refit_info Signed-off-by: Yuki Huang <yukih@nvidia.com>

add vllm dependency in mcore

0485ea0

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add unit test

d9b42ab

Signed-off-by: Yuki Huang <yukih@nvidia.com>

use PACK strategy for non-colocated

eaee9e1

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add comments for using PACK

8c59134

Signed-off-by: Yuki Huang <yukih@nvidia.com>

typo

abf1679

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/non-colocated-mcore branch from 140810b to abf1679 Compare August 7, 2025 13:37

terrykong previously approved these changes Aug 8, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 8, 2025

github-merge-queue bot pushed a commit that referenced this pull request Aug 8, 2025

feat: support non-colocated in mcore (#613)

25bb5c9

Signed-off-by: Yuki Huang <yukih@nvidia.com>

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 8, 2025

fix unit test

a23a0f8

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 dismissed terrykong’s stale review via a23a0f8 August 8, 2025 03:06

terrykong enabled auto-merge August 8, 2025 03:12

terrykong approved these changes Aug 8, 2025

View reviewed changes

terrykong added this pull request to the merge queue Aug 8, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 8, 2025

terrykong added this pull request to the merge queue Aug 8, 2025

Merged via the queue into main with commit b8a89a9 Aug 8, 2025
19 checks passed

terrykong deleted the yukih/non-colocated-mcore branch August 8, 2025 10:23

guyueh1 pushed a commit that referenced this pull request Aug 11, 2025

feat: support non-colocated in mcore (#613)

d03c4c7

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Guyue Huang <guyueh@nvidia.com>

soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025

feat: support non-colocated in mcore (NVIDIA-NeMo#613)

30472d2

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

jveronvialard pushed a commit that referenced this pull request Aug 27, 2025

feat: support non-colocated in mcore (#613)

200329f

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support non-colocated in mcore #613

feat: support non-colocated in mcore #613

Uh oh!

yuki-97 commented Jul 4, 2025 •

edited

Loading

Uh oh!

yuki-97 commented Jul 7, 2025

Uh oh!

Uh oh!

yuki-97 commented Aug 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: support non-colocated in mcore #613

feat: support non-colocated in mcore #613

Uh oh!

Conversation

yuki-97 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Test Result

Issues

Usage

Uh oh!

yuki-97 commented Jul 7, 2025

Uh oh!

Uh oh!

yuki-97 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuki-97 commented Jul 4, 2025 •

edited

Loading

yuki-97 commented Aug 5, 2025 •

edited

Loading