feat: introduce megatron checkpoint dir precedence #665

terrykong · 2025-07-14T21:54:01Z

when running multinode, unless someone knows to mount /opt/checkpoints everywhere, their multinode will fail. This change introduces the following dir precedence

1. **`NRL_MEGATRON_CHECKPOINT_DIR`** - Custom checkpoint directory path
2. [RECOMMENDED] **`HF_HOME/nemo_rl`** - Uses HuggingFace cache directory if available
3. **`~/.cache/huggingface/nemo_rl`** - Default fallback location

Most will specify HF_HOME, so this will seemlessly work with those training large models on multiple nodes.

Also, default to the cache dir otherwise since /opt is usually reserved for "optional packages" as opposed to data mounts (which usually reside in /mnt)

Signed-off-by: Terry Kong <terryk@nvidia.com>

jgerh

Completed the tech pubs review of docs/design-docs/training-backends.md and provided some copyedits for grammar and punctuation. Also revised the Configuration heading levels to fix "stacked" headings (placing one heading directly beneath another without any intervening body text.) It tends to break the flow.

docs/design-docs/training-backends.md

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Jialei Chen <jialeic@google.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

feat: introduce megatron checkpoint dir precedence

e5016a7

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong requested review from parthchadha and SahilJain314 July 14, 2025 21:54

github-actions bot added the documentation Improvements or additions to documentation label Jul 14, 2025

terrykong added the r0.3.0 Release r0.3.0 label Jul 14, 2025

terrykong mentioned this pull request Jul 15, 2025

No module named 'nemo' #662

Closed

parthchadha previously approved these changes Jul 15, 2025

View reviewed changes

add a note about input checkpoint formats

16c4862

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong dismissed parthchadha’s stale review via 16c4862 July 15, 2025 22:24

parthchadha previously approved these changes Jul 15, 2025

View reviewed changes

jgerh reviewed Jul 15, 2025

View reviewed changes

Apply suggestions from code review

5dc70d0

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>

terrykong dismissed parthchadha’s stale review via 5dc70d0 July 15, 2025 23:47

terrykong and others added 4 commits July 15, 2025 16:48

Apply suggestions from code review

b145765

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>

Update docs/design-docs/training-backends.md

724e827

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>

fixed

f384501

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix formatting

8f33d78

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong enabled auto-merge July 15, 2025 23:59

parthchadha approved these changes Jul 16, 2025

View reviewed changes

terrykong added this pull request to the merge queue Jul 16, 2025

Merged via the queue into main with commit d158cbc Jul 16, 2025
13 of 14 checks passed

terrykong deleted the tk/mcore-ckpt-dir branch July 16, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: introduce megatron checkpoint dir precedence #665

feat: introduce megatron checkpoint dir precedence #665

Uh oh!

terrykong commented Jul 14, 2025 •

edited

Loading

Uh oh!

jgerh left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: introduce megatron checkpoint dir precedence #665

feat: introduce megatron checkpoint dir precedence #665

Uh oh!

Conversation

terrykong commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong commented Jul 14, 2025 •

edited

Loading