Skip to content

Conversation

wedu-nvidia
Copy link
Contributor

…g time lmit

What does this PR do ?

Since the server automatically stops after 4 hours, it's recommended to save a checkpoint beforehand. For example, set the timeout to 3 hours and 45 minutes to ensure check point saved is saved in time

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@wedu-nvidia wedu-nvidia changed the title save checking point before timeout to deal with 4 hour limit feat: save checkpoint before timeout to avoid 4-hour runtime limit Jul 24, 2025
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for contributing. this would definitely be a valuable feature to have. I've left some comments

@wedu-nvidia
Copy link
Contributor Author

@terrykong I revised based on your suggestions and let me know if have more comments

@terrykong
Copy link
Contributor

@wedu-nvidia could you address the DCO failure and run the pre-commit hooks. See https://github.com/NVIDIA-NeMo/RL/blob/main/CONTRIBUTING.md

@wedu-nvidia wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from d106c44 to b3c7f82 Compare July 30, 2025 15:18
@github-actions github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Jul 30, 2025
@wedu-nvidia wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from b3c7f82 to 33597d1 Compare July 30, 2025 15:21
@github-actions github-actions bot removed documentation Improvements or additions to documentation CI Relating to CI labels Jul 30, 2025
@wedu-nvidia wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from 33597d1 to b3c7f82 Compare July 30, 2025 15:26
@github-actions github-actions bot added documentation Improvements or additions to documentation CI Relating to CI labels Jul 30, 2025
@wedu-nvidia wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch 2 times, most recently from 9cfa5a7 to b242f32 Compare July 30, 2025 20:48
@github-actions github-actions bot removed documentation Improvements or additions to documentation CI Relating to CI labels Jul 30, 2025
…g time lmit

Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
@wedu-nvidia wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from 831926d to e51cad2 Compare July 30, 2025 21:01
Signed-off-by: Wei Du <wedu@nvidia.com>
@wedu-nvidia wedu-nvidia force-pushed the wedu/timeout-save-checkpoint branch from e51cad2 to 421d34c Compare July 30, 2025 21:03
@wedu-nvidia
Copy link
Contributor Author

Hi @terrykong, all DCO and pre-commit issues have been resolved, and the commits are now properly signed.

Please help approve the pending workflows and review the change request when convenient — thanks!

@terrykong
Copy link
Contributor

@wedu-nvidia looks like there are still some failures, this time with pyrefly

ERROR `float` is not assignable to attribute `previous_iteration_time` with type `None` [bad-assignment]
   --> /home/runner/work/RL/RL/nemo_rl/utils/timer.py:311:40
    |
311 |         self.previous_iteration_time = time.time()
    |                                        ^^^^^^^^^^^
    |
ERROR `-` is not supported between `float` and `None` [bad-argument-type]
   --> /home/runner/work/RL/RL/nemo_rl/utils/timer.py:318:24
    |
318 |         elapsed_time = current_time - self.previous_iteration_time
    |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
  Argument `None` is not assignable to parameter `value` with type `float` in function `float.__sub__`
ERROR `float` is not assignable to attribute `previous_iteration_time` with type `None` [bad-assignment]
   --> /home/runner/work/RL/RL/nemo_rl/utils/timer.py:319:40
    |
319 |         self.previous_iteration_time = current_time
    |                                        ^^^^^^^^^^^^
    |
 INFO errors shown: 3, errors ignored: 24, modules: 82, transitive dependencies: 4,273, lines: 2,048,943, time: 4.08s, peak memory: physical 786.5 MiB

@wedu-nvidia
Copy link
Contributor Author

@terrykong May I know what is the status for this now?

@terrykong
Copy link
Contributor

This one looks to have a legitimate unit test failure:

FAILED unit/algorithms/test_sft.py::test_exit_on_max_steps - KeyError: 'save_...
image

When you've resolved, we can retry

Signed-off-by: Wei Du <wedu@nvidia.com>
@wedu-nvidia
Copy link
Contributor Author

@terrykong I added another parameter, and hope it can pass all.

@terrykong terrykong enabled auto-merge August 4, 2025 18:10
terrykong
terrykong previously approved these changes Aug 4, 2025
@terrykong terrykong added this pull request to the merge queue Aug 4, 2025
Signed-off-by: Wei Du <wedu@nvidia.com>
auto-merge was automatically disabled August 4, 2025 19:10

Head branch was pushed to by a user without write access

@wedu-nvidia
Copy link
Contributor Author

wedu-nvidia commented Aug 4, 2025

@terrykong The previous error seems solved and I saw another error and I added in
checkpoint_must_save_by: NotRequired[str | None]
in following config


class CheckpointingConfig(TypedDict):
    """Configuration for checkpoint management.

    Attributes:
    enabled (bool): Whether checkpointing is enabled.
    checkpoint_dir (PathLike): Directory where checkpoints will be saved.
    metric_name (str): Name of the metric to use for determining best checkpoints.
    higher_is_better (bool): Whether higher values of the metric indicate better performance.
    keep_top_k (Optional[int]): Number of best checkpoints to keep. If None, all checkpoints are kept.
    """

    enabled: bool
    checkpoint_dir: PathLike
    metric_name: str
    higher_is_better: bool
    save_period: int
    keep_top_k: NotRequired[int]
    checkpoint_must_save_by: NotRequired[str | None]

Signed-off-by: Wei Du <wedu@nvidia.com>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 4, 2025
@wedu-nvidia
Copy link
Contributor Author

@terrykong Can you help add it the merge queue again? Thanks so much

@terrykong terrykong enabled auto-merge August 5, 2025 07:24
@wedu-nvidia
Copy link
Contributor Author

@terrykong can you put it into mergequeue?

@terrykong terrykong added this pull request to the merge queue Aug 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Aug 5, 2025
@wedu-nvidia
Copy link
Contributor Author

@terrykong Why I did not see the conflict?

@terrykong terrykong enabled auto-merge August 5, 2025 20:03
@terrykong terrykong added this pull request to the merge queue Aug 5, 2025
Merged via the queue into NVIDIA-NeMo:main with commit b74c5d0 Aug 6, 2025
19 checks passed
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 13, 2025
…VIDIA-NeMo#734)

Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
youngeunkwon0405 added a commit to youngeunkwon0405/RL that referenced this pull request Aug 25, 2025
commit b246e55
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Mon Aug 25 15:05:48 2025 -0700

    update the script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit 5315a6b
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Mon Aug 25 13:59:16 2025 -0700

    script update

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit 4437402
Author: Youngeun Kwon <youngeunk@nvidia.com>
Date:   Tue Jul 15 17:42:23 2025 -0700

    local

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    wip

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    add script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    update script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    update script

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

    interactive

    Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

commit b721703
Author: Charlie Truong <chtruong@nvidia.com>
Date:   Mon Aug 18 11:22:54 2025 -0500

    build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936)

    Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit 70b9666
Author: Charlie Truong <chtruong@nvidia.com>
Date:   Sun Aug 17 21:17:58 2025 -0500

    build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897)

    Signed-off-by: Charlie Truong <chtruong@nvidia.com>

commit df31c1b
Author: pjin-nvidia <pjin@nvidia.com>
Date:   Thu Aug 14 18:34:50 2025 -0700

    feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918)

    Signed-off-by: Peter Jin <pjin@nvidia.com>

commit 83c6bfc
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Thu Aug 14 21:48:55 2025 +0800

    refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 9f7825e
Author: Rayen <130129397+RayenTian@users.noreply.github.com>
Date:   Thu Aug 14 12:38:27 2025 +0800

    feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879)

    Signed-off-by: ruit <ruit@nvidia.com>

commit e1f56c4
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Tue Aug 12 13:09:37 2025 -0700

    feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 223bfa8
Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com>
Date:   Mon Aug 11 18:19:52 2025 -0700

    feat: add nemotron5 sharding (NVIDIA-NeMo#481)

    Signed-off-by: Terry Kong <terryk@nvidia.com>
    Co-authored-by: Terry Kong <terryk@nvidia.com>

commit 18b9e2c
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Mon Aug 11 15:08:52 2025 -0700

    test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 8fd8c96
Author: guyueh1 <140554423+guyueh1@users.noreply.github.com>
Date:   Mon Aug 11 10:46:29 2025 -0700

    feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865)

    Signed-off-by: Guyue Huang <guyueh@nvidia.com>

commit 2b87def
Author: Qidong Su <soodoshll@gmail.com>
Date:   Fri Aug 8 18:54:20 2025 -0400

    fix: OOM in deepscaler1.5b with sequence length = 16/24k  (NVIDIA-NeMo#875)

    Signed-off-by: Qidong Su <qidongs@nvidia.com>

commit fecf71e
Author: Rayen <130129397+RayenTian@users.noreply.github.com>
Date:   Sat Aug 9 06:42:07 2025 +0800

    fix: remove tie weight check (NVIDIA-NeMo#700)

    Signed-off-by: ruit <ruit@nvidia.com>

commit d45ff3f
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Fri Aug 8 10:07:02 2025 -0700

    test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit d73c942
Author: Anna Shors <ashors@nvidia.com>
Date:   Fri Aug 8 09:27:15 2025 -0700

    feat: qwen3 export to HF (NVIDIA-NeMo#873)

    Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>
    Signed-off-by: Anna Shors <ashors@nvidia.com>
    Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>

commit e924d33
Author: Shang Wang <samshang.wang@mail.utoronto.ca>
Date:   Fri Aug 8 12:15:34 2025 -0400

    docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837)

    Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca>

commit bbbb3d6
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 23:26:15 2025 +0800

    fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 88a399e
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 14:04:08 2025 +0800

    chore: remove old fsdp1 unit test (NVIDIA-NeMo#871)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit b8a89a9
Author: yuki <48991475+yuki-666@users.noreply.github.com>
Date:   Fri Aug 8 13:56:19 2025 +0800

    feat: support non-colocated in mcore (NVIDIA-NeMo#613)

    Signed-off-by: Yuki Huang <yukih@nvidia.com>

commit 5910abb
Author: Anna Shors <ashors@nvidia.com>
Date:   Thu Aug 7 13:11:43 2025 -0700

    feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798)

    Signed-off-by: ashors1 <ashors@nvidia.com>

commit 0988a7d
Author: Felipe Vieira Frujeri <ffrujeri@gmail.com>
Date:   Wed Aug 6 22:01:32 2025 -0700

    fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633)

    Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

commit 233cc07
Author: Parth Chadha <pchadha@nvidia.com>
Date:   Wed Aug 6 15:14:22 2025 -0700

    fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857)

    Signed-off-by: Parth Chadha <pchadha@nvidia.com>

commit 0557402
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Wed Aug 6 14:44:29 2025 -0700

    chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 03472a0
Author: Terry Kong <terrycurtiskong@gmail.com>
Date:   Wed Aug 6 14:43:55 2025 -0700

    feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799)

    Signed-off-by: Terry Kong <terryk@nvidia.com>

commit 9af0a52
Author: Anna Shors <ashors@nvidia.com>
Date:   Wed Aug 6 12:35:51 2025 -0700

    fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844)

    Signed-off-by: ashors1 <ashors@nvidia.com>

commit b6269f7
Author: Yubo Gao <yubog@nvidia.com>
Date:   Tue Aug 5 16:55:02 2025 -0400

    feat: track policy training compute throughput (NVIDIA-NeMo#632)

    Signed-off-by: Yubo Gao <yubog@nvidia.com>

commit b74c5d0
Author: Wei Du <wedu@nvidia.com>
Date:   Tue Aug 5 15:05:13 2025 -0500

    feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734)

    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
    Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>

commit c784dd9
Author: Zhiyu Li <zhiyul@NVIDIA.com>
Date:   Tue Aug 5 10:47:30 2025 -0700

    feat: add data shuffle and random seed option (NVIDIA-NeMo#334)

    Signed-off-by: Zhiyu Li <zhiyul@nvidia.com>
    Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

commit c249efc
Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com>
Date:   Tue Aug 5 21:33:28 2025 +0400

    docs: fix checkpointing command for megatron->hf export  (NVIDIA-NeMo#823)

    Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae>

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
)

Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants