Skip to content

Conversation

pjin-nvidia
Copy link
Contributor

re-submission of #856 to work around CI issues

@wangshangsam @terrykong @chtruong814

Signed-off-by: Peter Jin <pjin@nvidia.com>
Based on NeMo commit: 8ddf4387344c6423763ec9ee0c9a755cbb5d8d35

Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
@pjin-nvidia pjin-nvidia added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Aug 14, 2025
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: d746bbb (PR #918 from pjin/logprob)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: d746bbb (PR #918 from pjin/logprob)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Peter Jin <pjin@nvidia.com>
@pjin-nvidia pjin-nvidia removed the CI:L1 Run doctests, unit tests, and functional tests label Aug 14, 2025
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 5672258 (PR #918 from pjin/logprob)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@pjin-nvidia pjin-nvidia added the CI:L1 Run doctests, unit tests, and functional tests label Aug 14, 2025
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 5672258 (PR #918 from pjin/logprob)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: e5428b6 (PR #918 from pjin/logprob)

✅ Submodules that are properly updated:

NeMo: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@terrykong terrykong enabled auto-merge August 15, 2025 01:34
@terrykong terrykong added this pull request to the merge queue Aug 15, 2025
Merged via the queue into main with commit df31c1b Aug 15, 2025
19 checks passed
@terrykong terrykong deleted the pjin/logprob branch August 15, 2025 04:25
zhandaz pushed a commit that referenced this pull request Aug 19, 2025
…h OOM (#918)

Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Zhanda <zhandazhu@gmail.com>
youngeunkwon0405 added a commit to youngeunkwon0405/RL that referenced this pull request Aug 25, 2025
commit b246e55
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Mon Aug 25 15:05:48 2025 -0700

    update the script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit 5315a6b
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Mon Aug 25 13:59:16 2025 -0700

    script update

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit 4437402
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Tue Jul 15 17:42:23 2025 -0700

    local

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    wip

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    add script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    update script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    update script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    interactive

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit b721703
Author: Charlie Truong <chtruong@nvidia.com>
Date:   Mon Aug 18 11:22:54 2025 -0500

    build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936)

    Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit 70b9666
Author: Charlie Truong <chtruong@nvidia.com>
Date:   Sun Aug 17 21:17:58 2025 -0500

    build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897)

    Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit df31c1b
Author: pjin-nvidia <pjin@nvidia.com>
Date:   Thu Aug 14 18:34:50 2025 -0700

    feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918)

    Signed-off-by: Peter Jin <pjin@nvidia.com>

commit 83c6bfc
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Thu Aug 14 21:48:55 2025 +0800

    refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 9f7825e
Author: Rayen <130129397+RayenTian@users.noreply.github.com>
Date:   Thu Aug 14 12:38:27 2025 +0800

    feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879)

    Signed-off-by: ruit <ruit@nvidia.com>

commit e1f56c4
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Tue Aug 12 13:09:37 2025 -0700

    feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 223bfa8
Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com>
Date:   Mon Aug 11 18:19:52 2025 -0700

    feat: add nemotron5 sharding (NVIDIA-NeMo#481)

    Signed-off-by: Terry Kong <terryk@nvidia.com>
    Co-authored-by: Terry Kong <terryk@nvidia.com>

commit 18b9e2c
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Mon Aug 11 15:08:52 2025 -0700

    test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 8fd8c96
Author: guyueh1 <140554423+guyueh1@users.noreply.github.com>
Date:   Mon Aug 11 10:46:29 2025 -0700

    feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865)

    Signed-off-by: Guyue Huang <guyueh@nvidia.com>

commit 2b87def
Author: Qidong Su <soodoshll@gmail.com>
Date:   Fri Aug 8 18:54:20 2025 -0400

    fix: OOM in deepscaler1.5b with sequence length = 16/24k  (NVIDIA-NeMo#875)

    Signed-off-by: Qidong Su <qidongs@nvidia.com>

commit fecf71e
Author: Rayen <130129397+RayenTian@users.noreply.github.com>
Date:   Sat Aug 9 06:42:07 2025 +0800

    fix: remove tie weight check (NVIDIA-NeMo#700)

    Signed-off-by: ruit <ruit@nvidia.com>

commit d45ff3f
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Fri Aug 8 10:07:02 2025 -0700

    test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit d73c942
Author: Anna Shors <ashors@nvidia.com>
Date:   Fri Aug 8 09:27:15 2025 -0700

    feat: qwen3 export to HF (NVIDIA-NeMo#873)

    Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>
    Signed-off-by: Anna Shors <ashors@nvidia.com>
    Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>

commit e924d33
Author: Shang Wang <samshang.wang@mail.utoronto.ca>
Date:   Fri Aug 8 12:15:34 2025 -0400

    docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837)

    Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca>

commit bbbb3d6
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 23:26:15 2025 +0800

    fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 88a399e
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 14:04:08 2025 +0800

    chore: remove old fsdp1 unit test (NVIDIA-NeMo#871)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit b8a89a9
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 13:56:19 2025 +0800

    feat: support non-colocated in mcore (NVIDIA-NeMo#613)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 5910abb
Author: Anna Shors <ashors@nvidia.com>
Date:   Thu Aug 7 13:11:43 2025 -0700

    feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798)

    Signed-off-by: ashors1 <ashors@nvidia.com>

commit 0988a7d
Author: Felipe Vieira Frujeri <ffrujeri@gmail.com>
Date:   Wed Aug 6 22:01:32 2025 -0700

    fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633)

    Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

commit 233cc07
Author: Parth Chadha <pchadha@nvidia.com>
Date:   Wed Aug 6 15:14:22 2025 -0700

    fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857)

    Signed-off-by: Parth Chadha <pchadha@nvidia.com>

commit 0557402
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Wed Aug 6 14:44:29 2025 -0700

    chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 03472a0
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Wed Aug 6 14:43:55 2025 -0700

    feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 9af0a52
Author: Anna Shors <ashors@nvidia.com>
Date:   Wed Aug 6 12:35:51 2025 -0700

    fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844)

    Signed-off-by: ashors1 <ashors@nvidia.com>

commit b6269f7
Author: Yubo Gao <yubog@nvidia.com>
Date:   Tue Aug 5 16:55:02 2025 -0400

    feat: track policy training compute throughput (NVIDIA-NeMo#632)

    Signed-off-by: Yubo Gao <yubog@nvidia.com>

commit b74c5d0
Author: Wei Du <wedu@nvidia.com>
Date:   Tue Aug 5 15:05:13 2025 -0500

    feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734)

    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
    Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>

commit c784dd9
Author: Zhiyu Li <zhiyul@NVIDIA.com>
Date:   Tue Aug 5 10:47:30 2025 -0700

    feat: add data shuffle and random seed option (NVIDIA-NeMo#334)

    Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
    Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

commit c249efc
Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>
Date:   Tue Aug 5 21:33:28 2025 +0400

    docs: fix checkpointing command for megatron->hf export  (NVIDIA-NeMo#823)

    Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae>

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
…h OOM (#918)

Signed-off-by: Peter Jin <pjin@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants