Skip to content

Conversation

yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Jul 4, 2025

What does this PR do ?

Support non-colocated (both sync and async) in mcore worker.

Test Result

logprob_inference_prep is slower than colocated is because we need to offload optimizer here, colocated already offload it in refit_policy_generation.

  1. llama3
    colocated (baseline): using 2 node.
    non-colocated: using 2 node for train and 1 node for inference.
    Convergence Time Cost
    image image
  2. dsv3
    colocated (baseline): using 32 node.
    non-colocated: using 32 node for train and 8/16 node for inference.
    Convergence Time Cost
    image image

Issues

Closes #557.

Usage

Train resources will be inferred from overall and inference resources.
i.e. training nodes = overall nodes - inference nodes

# 1 node with 8 GPUs, 4 GPUs for train and 4 GPUs for inference
uv run python examples/run_grpo_math.py \
    --config examples/configs/grpo_math_1B_megatron.yaml \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.gpus_per_node=4 \
    cluster.num_nodes=1 \
    cluster.gpus_per_node=8

# 5 nodes with 8GPUs, 4 nodes for train and 1 node for inference
uv run python examples/run_grpo_math.py \
    --config examples/configs/grpo_math_1B_megatron.yaml \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.num_nodes=1 \
    cluster.num_nodes=5 \
    cluster.gpus_per_node=8

@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jul 4, 2025
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 4, 2025
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-mcore branch from e913e05 to 0ec3a7c Compare July 4, 2025 13:47
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 4, 2025
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 4, 2025
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-mcore branch 2 times, most recently from d4bc832 to 5d1efb8 Compare July 7, 2025 07:28
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 7, 2025
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jul 7, 2025
@yuki-97 yuki-97 marked this pull request as ready for review July 7, 2025 12:12
@yuki-97
Copy link
Contributor Author

yuki-97 commented Jul 7, 2025

I'd like to delay merging this PR after v0.3 for de-risk.
But if you have time, you can review it first, so that I can do some update a bit earlier.

@yuki-97
Copy link
Contributor Author

yuki-97 commented Aug 5, 2025

wait until #766 merged, need rebase and also bump vllm to v0.10.0.
done

yuki-97 added a commit that referenced this pull request Aug 5, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-mcore branch 2 times, most recently from ed8f14b to 140810b Compare August 6, 2025 12:33
yuki-97 added 6 commits August 7, 2025 13:35
add init_collective

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add broadcast

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update prepare_refit_info

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/non-colocated-mcore branch from 140810b to abf1679 Compare August 7, 2025 13:37
terrykong
terrykong previously approved these changes Aug 8, 2025
@terrykong terrykong added this pull request to the merge queue Aug 8, 2025
github-merge-queue bot pushed a commit that referenced this pull request Aug 8, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 8, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@terrykong terrykong added this pull request to the merge queue Aug 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 8, 2025
@terrykong terrykong added this pull request to the merge queue Aug 8, 2025
Merged via the queue into main with commit b8a89a9 Aug 8, 2025
19 checks passed
@terrykong terrykong deleted the yukih/non-colocated-mcore branch August 8, 2025 10:23
guyueh1 pushed a commit that referenced this pull request Aug 11, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
youngeunkwon0405 added a commit to youngeunkwon0405/RL that referenced this pull request Aug 25, 2025
commit b246e55
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Mon Aug 25 15:05:48 2025 -0700

    update the script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit 5315a6b
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Mon Aug 25 13:59:16 2025 -0700

    script update

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit 4437402
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Tue Jul 15 17:42:23 2025 -0700

    local

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    wip

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    add script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    update script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    update script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    interactive

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit b721703
Author: Charlie Truong <chtruong@nvidia.com>
Date:   Mon Aug 18 11:22:54 2025 -0500

    build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936)

    Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit 70b9666
Author: Charlie Truong <chtruong@nvidia.com>
Date:   Sun Aug 17 21:17:58 2025 -0500

    build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897)

    Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit df31c1b
Author: pjin-nvidia <pjin@nvidia.com>
Date:   Thu Aug 14 18:34:50 2025 -0700

    feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918)

    Signed-off-by: Peter Jin <pjin@nvidia.com>

commit 83c6bfc
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Thu Aug 14 21:48:55 2025 +0800

    refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 9f7825e
Author: Rayen <130129397+RayenTian@users.noreply.github.com>
Date:   Thu Aug 14 12:38:27 2025 +0800

    feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879)

    Signed-off-by: ruit <ruit@nvidia.com>

commit e1f56c4
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Tue Aug 12 13:09:37 2025 -0700

    feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 223bfa8
Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com>
Date:   Mon Aug 11 18:19:52 2025 -0700

    feat: add nemotron5 sharding (NVIDIA-NeMo#481)

    Signed-off-by: Terry Kong <terryk@nvidia.com>
    Co-authored-by: Terry Kong <terryk@nvidia.com>

commit 18b9e2c
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Mon Aug 11 15:08:52 2025 -0700

    test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 8fd8c96
Author: guyueh1 <140554423+guyueh1@users.noreply.github.com>
Date:   Mon Aug 11 10:46:29 2025 -0700

    feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865)

    Signed-off-by: Guyue Huang <guyueh@nvidia.com>

commit 2b87def
Author: Qidong Su <soodoshll@gmail.com>
Date:   Fri Aug 8 18:54:20 2025 -0400

    fix: OOM in deepscaler1.5b with sequence length = 16/24k  (NVIDIA-NeMo#875)

    Signed-off-by: Qidong Su <qidongs@nvidia.com>

commit fecf71e
Author: Rayen <130129397+RayenTian@users.noreply.github.com>
Date:   Sat Aug 9 06:42:07 2025 +0800

    fix: remove tie weight check (NVIDIA-NeMo#700)

    Signed-off-by: ruit <ruit@nvidia.com>

commit d45ff3f
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Fri Aug 8 10:07:02 2025 -0700

    test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit d73c942
Author: Anna Shors <ashors@nvidia.com>
Date:   Fri Aug 8 09:27:15 2025 -0700

    feat: qwen3 export to HF (NVIDIA-NeMo#873)

    Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>
    Signed-off-by: Anna Shors <ashors@nvidia.com>
    Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>

commit e924d33
Author: Shang Wang <samshang.wang@mail.utoronto.ca>
Date:   Fri Aug 8 12:15:34 2025 -0400

    docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837)

    Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca>

commit bbbb3d6
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 23:26:15 2025 +0800

    fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 88a399e
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 14:04:08 2025 +0800

    chore: remove old fsdp1 unit test (NVIDIA-NeMo#871)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit b8a89a9
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 13:56:19 2025 +0800

    feat: support non-colocated in mcore (NVIDIA-NeMo#613)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 5910abb
Author: Anna Shors <ashors@nvidia.com>
Date:   Thu Aug 7 13:11:43 2025 -0700

    feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798)

    Signed-off-by: ashors1 <ashors@nvidia.com>

commit 0988a7d
Author: Felipe Vieira Frujeri <ffrujeri@gmail.com>
Date:   Wed Aug 6 22:01:32 2025 -0700

    fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633)

    Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

commit 233cc07
Author: Parth Chadha <pchadha@nvidia.com>
Date:   Wed Aug 6 15:14:22 2025 -0700

    fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857)

    Signed-off-by: Parth Chadha <pchadha@nvidia.com>

commit 0557402
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Wed Aug 6 14:44:29 2025 -0700

    chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 03472a0
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Wed Aug 6 14:43:55 2025 -0700

    feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 9af0a52
Author: Anna Shors <ashors@nvidia.com>
Date:   Wed Aug 6 12:35:51 2025 -0700

    fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844)

    Signed-off-by: ashors1 <ashors@nvidia.com>

commit b6269f7
Author: Yubo Gao <yubog@nvidia.com>
Date:   Tue Aug 5 16:55:02 2025 -0400

    feat: track policy training compute throughput (NVIDIA-NeMo#632)

    Signed-off-by: Yubo Gao <yubog@nvidia.com>

commit b74c5d0
Author: Wei Du <wedu@nvidia.com>
Date:   Tue Aug 5 15:05:13 2025 -0500

    feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734)

    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
    Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>

commit c784dd9
Author: Zhiyu Li <zhiyul@NVIDIA.com>
Date:   Tue Aug 5 10:47:30 2025 -0700

    feat: add data shuffle and random seed option (NVIDIA-NeMo#334)

    Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
    Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

commit c249efc
Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>
Date:   Tue Aug 5 21:33:28 2025 +0400

    docs: fix checkpointing command for megatron->hf export  (NVIDIA-NeMo#823)

    Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae>

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support non-colocated in mcore worker
4 participants