build: Add Dockerfile that uses NGC pytorch image #897

chtruong814 · 2025-08-12T12:34:44Z

What does this PR do ?

Add Dockerfile that uses NGC pytorch image

We need to build vllm from source to work with the pytorch in the container. This is installed with --no-deps to avoid overriding installed dependencies in the container such as torch and numpy.
Move the uv install and cache directories to outside of root home directory to prevent them from being overridden if the root home directory is bind mounted when running the container
Dependencies are installed into a single venv
Ray workers use the system python executable rather than their own venv. This is controlled by setting the env var NEMO_RL_PY_EXECUTABLES_SYSTEM=1 in the container.
uv sync is disabled to avoid unintentionally overriding installed dependencies
Set VLLM_USE_STANDALONE_COMPILE=0 because vllm 0.10.0 sets that by default and it detects the torch version in the 25.06 pytorch container version as 2.8.0. However, the NGC container is missing some required methods to work correctly with that env var set.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

github-actions · 2025-08-12T12:35:56Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 95126e2 (PR #897 from chtruong/build-ngc-torch)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/chtruong814/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #897 from chtruong/build-ngc-torch): https://github.com/chtruong814/Megatron-LM/commits/ab433d69f50dd0fce828742431db1e5c7ac5a98d/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

github-actions · 2025-08-12T12:37:14Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 95126e2 (PR #897 from chtruong/build-ngc-torch)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/chtruong814/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #897 from chtruong/build-ngc-torch): https://github.com/chtruong814/Megatron-LM/commits/ab433d69f50dd0fce828742431db1e5c7ac5a98d/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

This reverts commit 6e38f2a. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

docker/Dockerfile.ngc_pytorch

docker/Dockerfile

docker/Dockerfile.ngc_pytorch

terrykong · 2025-08-13T05:13:56Z

Is this comment in the description still accurate?

Mcore ref is updated to a newer version 0.13.0 but includes the RL specific changes. We'll want to bump to a newer Mcore commit later

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2025-08-14T01:38:38Z

@terrykong thanks for the review. could you take another look when you get a chance?

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Zhanda <zhandazhu@gmail.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit b246e55 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 5315a6b Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 4437402 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> wip Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> add script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> interactive Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit b721703 Author: Charlie Truong <chtruong@nvidia.com> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit 70b9666 Author: Charlie Truong <chtruong@nvidia.com> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit df31c1b Author: pjin-nvidia <pjin@nvidia.com> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <pjin@nvidia.com> commit 83c6bfc Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 9f7825e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <ruit@nvidia.com> commit e1f56c4 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 223bfa8 Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> commit 18b9e2c Author: Terry Kong <terrycurtiskong@gmail.com> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 8fd8c96 Author: guyueh1 <140554423+guyueh1@users.noreply.github.com> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <guyueh@nvidia.com> commit 2b87def Author: Qidong Su <soodoshll@gmail.com> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <qidongs@nvidia.com> commit fecf71e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <ruit@nvidia.com> commit d45ff3f Author: Terry Kong <terrycurtiskong@gmail.com> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <terryk@nvidia.com> commit d73c942 Author: Anna Shors <ashors@nvidia.com> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Signed-off-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> commit e924d33 Author: Shang Wang <samshang.wang@mail.utoronto.ca> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca> commit bbbb3d6 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 88a399e Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit b8a89a9 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 5910abb Author: Anna Shors <ashors@nvidia.com> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <ashors@nvidia.com> commit 0988a7d Author: Felipe Vieira Frujeri <ffrujeri@gmail.com> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> commit 233cc07 Author: Parth Chadha <pchadha@nvidia.com> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <pchadha@nvidia.com> commit 0557402 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 03472a0 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 9af0a52 Author: Anna Shors <ashors@nvidia.com> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <ashors@nvidia.com> commit b6269f7 Author: Yubo Gao <yubog@nvidia.com> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <yubog@nvidia.com> commit b74c5d0 Author: Wei Du <wedu@nvidia.com> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> commit c784dd9 Author: Zhiyu Li <zhiyul@NVIDIA.com> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> commit c249efc Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

chtruong814 requested a review from terrykong August 12, 2025 12:34

chtruong814 added the CI:L1 Run doctests, unit tests, and functional tests label Aug 12, 2025

chtruong814 had a problem deploying to nemo-ci August 12, 2025 12:35 — with GitHub Actions Error

chtruong814 added 3 commits August 12, 2025 07:36

Optionally build RL with ngc torch container

8494039

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Do not install TE when using pytorch image

29a9732

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Exclude transformer-engine-cu-12 when using pytorch container

c5a07b3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added 22 commits August 12, 2025 07:37

Do not install custom python in ngc torch container

9ff3a6b

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Add no-install-pytorch-deps

ced97c4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Use system executable

fc49eb3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Update mcore ref to use 0.13.0 fork with sahil cherry-picks

d711bc3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Revert original Dockerfile

51fe21c

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Revert "Revert original Dockerfile"

bbee575

This reverts commit 6e38f2a. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Build vllm with uv

32e09ff

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix vllm build in uv

b452cc0

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix vllm output directory

9dba4a7

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Revert docker container

ddf4d3c

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Add ngc pytorch Dockerfile

0c73b13

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Use system executable given NEMO_RL_PY_EXECUTABLES_SYSTEM

26b42a0

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Ensure uv is installed

45c9f9a

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix vllm build

a15ac1c

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix vllm install

bed0c3d

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix no install env var

534e426

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix uv install

f0ad336

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Ensure numpy is not upgraded during install

d5517cf

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix ngc override location

150b17f

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Do not use uv to build vllm

653a39b

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix vllm build

b17c3c6

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Build vllm with pip

aed522f

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 temporarily deployed to nemo-ci August 13, 2025 01:30 — with GitHub Actions Inactive

chtruong814 temporarily deployed to nemo-ci August 13, 2025 02:46 — with GitHub Actions Inactive

terrykong reviewed Aug 13, 2025

View reviewed changes

docker/Dockerfile.ngc_pytorch Show resolved Hide resolved

chtruong814 added 3 commits August 13, 2025 20:23

Update Dockerfile based on feedback

4e07ced

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Remove unused file from .gitignore

0096d65

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Revert Dockerfile

b734d24

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added 2 commits August 14, 2025 17:21

Merge remote-tracking branch 'origin/main' into chtruong/build-ngc-torch

aba3b85

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Add comment around installing new dependencies

bfe5610

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

terrykong approved these changes Aug 16, 2025

View reviewed changes

terrykong enabled auto-merge August 16, 2025 00:41

terrykong added this pull request to the merge queue Aug 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Aug 16, 2025

chtruong814 added this pull request to the merge queue Aug 17, 2025

chtruong814 removed this pull request from the merge queue due to a manual request Aug 17, 2025

chtruong814 added this pull request to the merge queue Aug 18, 2025

Merged via the queue into main with commit 70b9666 Aug 18, 2025
19 checks passed

chtruong814 deleted the chtruong/build-ngc-torch branch August 18, 2025 06:46

zhandaz pushed a commit that referenced this pull request Aug 19, 2025

build: Add Dockerfile that uses NGC pytorch image (#897)

858d0fe

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Zhanda <zhandazhu@gmail.com>

terrykong linked an issue Aug 20, 2025 that may be closed by this pull request

Support NGC Pytorch containers #741

Closed

This was referenced Aug 20, 2025

B200/GB200 on NGC pytorch #952

Closed

Support NGC Pytorch containers #741

Closed

soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 20, 2025

build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897)

d93d050

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

chtruong814 added a commit that referenced this pull request Aug 21, 2025

build: Add Dockerfile that uses NGC pytorch image (#897)

98e359f

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

jveronvialard pushed a commit that referenced this pull request Aug 27, 2025

build: Add Dockerfile that uses NGC pytorch image (#897)

c1ec99e

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

build: Add Dockerfile that uses NGC pytorch image #897

build: Add Dockerfile that uses NGC pytorch image #897

Uh oh!

chtruong814 commented Aug 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong commented Aug 13, 2025

Uh oh!

chtruong814 commented Aug 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

build: Add Dockerfile that uses NGC pytorch image #897

build: Add Dockerfile that uses NGC pytorch image #897

Uh oh!

Conversation

chtruong814 commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

github-actions bot commented Aug 12, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Aug 12, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong commented Aug 13, 2025

Uh oh!

chtruong814 commented Aug 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chtruong814 commented Aug 12, 2025 •

edited

Loading