Skip to content

Conversation

chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Aug 12, 2025

What does this PR do ?

Add Dockerfile that uses NGC pytorch image

  • We need to build vllm from source to work with the pytorch in the container. This is installed with --no-deps to avoid overriding installed dependencies in the container such as torch and numpy.
  • Move the uv install and cache directories to outside of root home directory to prevent them from being overridden if the root home directory is bind mounted when running the container
  • Dependencies are installed into a single venv
  • Ray workers use the system python executable rather than their own venv. This is controlled by setting the env var NEMO_RL_PY_EXECUTABLES_SYSTEM=1 in the container.
  • uv sync is disabled to avoid unintentionally overriding installed dependencies
  • Set VLLM_USE_STANDALONE_COMPILE=0 because vllm 0.10.0 sets that by default and it detects the torch version in the 25.06 pytorch container version as 2.8.0. However, the NGC container is missing some required methods to work correctly with that env var set.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@chtruong814 chtruong814 requested a review from terrykong August 12, 2025 12:34
@chtruong814 chtruong814 added the CI:L1 Run doctests, unit tests, and functional tests label Aug 12, 2025
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 95126e2 (PR #897 from chtruong/build-ngc-torch)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/chtruong814/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #897 from chtruong/build-ngc-torch): https://github.com/chtruong814/Megatron-LM/commits/ab433d69f50dd0fce828742431db1e5c7ac5a98d/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 95126e2 (PR #897 from chtruong/build-ngc-torch)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/chtruong814/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #897 from chtruong/build-ngc-torch): https://github.com/chtruong814/Megatron-LM/commits/ab433d69f50dd0fce828742431db1e5c7ac5a98d/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
This reverts commit 6e38f2a.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@terrykong
Copy link
Contributor

Is this comment in the description still accurate?

Mcore ref is updated to a newer version 0.13.0 but includes the RL specific changes. We'll want to bump to a newer Mcore commit later

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@chtruong814
Copy link
Contributor Author

@terrykong thanks for the review. could you take another look when you get a chance?

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@terrykong terrykong enabled auto-merge August 16, 2025 00:41
@terrykong terrykong added this pull request to the merge queue Aug 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Aug 16, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Aug 17, 2025
@chtruong814 chtruong814 removed this pull request from the merge queue due to a manual request Aug 17, 2025
@chtruong814 chtruong814 added this pull request to the merge queue Aug 18, 2025
Merged via the queue into main with commit 70b9666 Aug 18, 2025
19 checks passed
@chtruong814 chtruong814 deleted the chtruong/build-ngc-torch branch August 18, 2025 06:46
zhandaz pushed a commit that referenced this pull request Aug 19, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Zhanda <zhandazhu@gmail.com>
@terrykong terrykong linked an issue Aug 20, 2025 that may be closed by this pull request
This was referenced Aug 20, 2025
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 20, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
chtruong814 added a commit that referenced this pull request Aug 21, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
youngeunkwon0405 added a commit to youngeunkwon0405/RL that referenced this pull request Aug 25, 2025
commit b246e55
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Mon Aug 25 15:05:48 2025 -0700

    update the script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit 5315a6b
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Mon Aug 25 13:59:16 2025 -0700

    script update

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit 4437402
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Tue Jul 15 17:42:23 2025 -0700

    local

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    wip

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    add script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    update script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    update script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    interactive

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit b721703
Author: Charlie Truong <chtruong@nvidia.com>
Date:   Mon Aug 18 11:22:54 2025 -0500

    build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936)

    Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit 70b9666
Author: Charlie Truong <chtruong@nvidia.com>
Date:   Sun Aug 17 21:17:58 2025 -0500

    build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897)

    Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit df31c1b
Author: pjin-nvidia <pjin@nvidia.com>
Date:   Thu Aug 14 18:34:50 2025 -0700

    feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918)

    Signed-off-by: Peter Jin <pjin@nvidia.com>

commit 83c6bfc
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Thu Aug 14 21:48:55 2025 +0800

    refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 9f7825e
Author: Rayen <130129397+RayenTian@users.noreply.github.com>
Date:   Thu Aug 14 12:38:27 2025 +0800

    feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879)

    Signed-off-by: ruit <ruit@nvidia.com>

commit e1f56c4
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Tue Aug 12 13:09:37 2025 -0700

    feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 223bfa8
Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com>
Date:   Mon Aug 11 18:19:52 2025 -0700

    feat: add nemotron5 sharding (NVIDIA-NeMo#481)

    Signed-off-by: Terry Kong <terryk@nvidia.com>
    Co-authored-by: Terry Kong <terryk@nvidia.com>

commit 18b9e2c
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Mon Aug 11 15:08:52 2025 -0700

    test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 8fd8c96
Author: guyueh1 <140554423+guyueh1@users.noreply.github.com>
Date:   Mon Aug 11 10:46:29 2025 -0700

    feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865)

    Signed-off-by: Guyue Huang <guyueh@nvidia.com>

commit 2b87def
Author: Qidong Su <soodoshll@gmail.com>
Date:   Fri Aug 8 18:54:20 2025 -0400

    fix: OOM in deepscaler1.5b with sequence length = 16/24k  (NVIDIA-NeMo#875)

    Signed-off-by: Qidong Su <qidongs@nvidia.com>

commit fecf71e
Author: Rayen <130129397+RayenTian@users.noreply.github.com>
Date:   Sat Aug 9 06:42:07 2025 +0800

    fix: remove tie weight check (NVIDIA-NeMo#700)

    Signed-off-by: ruit <ruit@nvidia.com>

commit d45ff3f
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Fri Aug 8 10:07:02 2025 -0700

    test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit d73c942
Author: Anna Shors <ashors@nvidia.com>
Date:   Fri Aug 8 09:27:15 2025 -0700

    feat: qwen3 export to HF (NVIDIA-NeMo#873)

    Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>
    Signed-off-by: Anna Shors <ashors@nvidia.com>
    Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>

commit e924d33
Author: Shang Wang <samshang.wang@mail.utoronto.ca>
Date:   Fri Aug 8 12:15:34 2025 -0400

    docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837)

    Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca>

commit bbbb3d6
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 23:26:15 2025 +0800

    fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 88a399e
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 14:04:08 2025 +0800

    chore: remove old fsdp1 unit test (NVIDIA-NeMo#871)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit b8a89a9
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 13:56:19 2025 +0800

    feat: support non-colocated in mcore (NVIDIA-NeMo#613)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 5910abb
Author: Anna Shors <ashors@nvidia.com>
Date:   Thu Aug 7 13:11:43 2025 -0700

    feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798)

    Signed-off-by: ashors1 <ashors@nvidia.com>

commit 0988a7d
Author: Felipe Vieira Frujeri <ffrujeri@gmail.com>
Date:   Wed Aug 6 22:01:32 2025 -0700

    fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633)

    Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

commit 233cc07
Author: Parth Chadha <pchadha@nvidia.com>
Date:   Wed Aug 6 15:14:22 2025 -0700

    fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857)

    Signed-off-by: Parth Chadha <pchadha@nvidia.com>

commit 0557402
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Wed Aug 6 14:44:29 2025 -0700

    chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 03472a0
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Wed Aug 6 14:43:55 2025 -0700

    feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 9af0a52
Author: Anna Shors <ashors@nvidia.com>
Date:   Wed Aug 6 12:35:51 2025 -0700

    fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844)

    Signed-off-by: ashors1 <ashors@nvidia.com>

commit b6269f7
Author: Yubo Gao <yubog@nvidia.com>
Date:   Tue Aug 5 16:55:02 2025 -0400

    feat: track policy training compute throughput (NVIDIA-NeMo#632)

    Signed-off-by: Yubo Gao <yubog@nvidia.com>

commit b74c5d0
Author: Wei Du <wedu@nvidia.com>
Date:   Tue Aug 5 15:05:13 2025 -0500

    feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734)

    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
    Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>

commit c784dd9
Author: Zhiyu Li <zhiyul@NVIDIA.com>
Date:   Tue Aug 5 10:47:30 2025 -0700

    feat: add data shuffle and random seed option (NVIDIA-NeMo#334)

    Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
    Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

commit c249efc
Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>
Date:   Tue Aug 5 21:33:28 2025 +0400

    docs: fix checkpointing command for megatron->hf export  (NVIDIA-NeMo#823)

    Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae>

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI:L1 Run doctests, unit tests, and functional tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support NGC Pytorch containers
2 participants