Skip to content

Conversation

terrykong
Copy link
Contributor

@terrykong terrykong commented Jul 14, 2025

when running multinode, unless someone knows to mount /opt/checkpoints everywhere, their multinode will fail. This change introduces the following dir precedence

1. **`NRL_MEGATRON_CHECKPOINT_DIR`** - Custom checkpoint directory path
2. [RECOMMENDED] **`HF_HOME/nemo_rl`** - Uses HuggingFace cache directory if available
3. **`~/.cache/huggingface/nemo_rl`** - Default fallback location

Most will specify HF_HOME, so this will seemlessly work with those training large models on multiple nodes.

Also, default to the cache dir otherwise since /opt is usually reserved for "optional packages" as opposed to data mounts (which usually reside in /mnt)

Signed-off-by: Terry Kong <terryk@nvidia.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 14, 2025
@terrykong terrykong added the r0.3.0 Release r0.3.0 label Jul 14, 2025
@terrykong terrykong mentioned this pull request Jul 15, 2025
parthchadha
parthchadha previously approved these changes Jul 15, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
parthchadha
parthchadha previously approved these changes Jul 15, 2025
Copy link
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed the tech pubs review of docs/design-docs/training-backends.md and provided some copyedits for grammar and punctuation. Also revised the Configuration heading levels to fix "stacked" headings (placing one heading directly beneath another without any intervening body text.) It tends to break the flow.

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
terrykong and others added 4 commits July 15, 2025 16:48
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong enabled auto-merge July 15, 2025 23:59
@terrykong terrykong added this pull request to the merge queue Jul 16, 2025
Merged via the queue into main with commit d158cbc Jul 16, 2025
13 of 14 checks passed
@terrykong terrykong deleted the tk/mcore-ckpt-dir branch July 16, 2025 09:38
ZhiyuLi-Nvidia pushed a commit that referenced this pull request Jul 21, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
jialei777 pushed a commit to jialei777/nemo-rl that referenced this pull request Jul 23, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Jialei Chen <jialeic@google.com>
KiddoZhu pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
xxman-google pushed a commit to xxman-google/NeMo-RL that referenced this pull request Jul 30, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
FannYYW pushed a commit to xxman-google/NeMo-RL that referenced this pull request Aug 5, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation r0.3.0 Release r0.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants