-
Notifications
You must be signed in to change notification settings - Fork 122
feat: save checkpoint before timeout to avoid 4-hour runtime limit #734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: save checkpoint before timeout to avoid 4-hour runtime limit #734
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for contributing. this would definitely be a valuable feature to have. I've left some comments
@terrykong I revised based on your suggestions and let me know if have more comments |
@wedu-nvidia could you address the DCO failure and run the pre-commit hooks. See https://github.com/NVIDIA-NeMo/RL/blob/main/CONTRIBUTING.md |
d106c44
to
b3c7f82
Compare
b3c7f82
to
33597d1
Compare
33597d1
to
b3c7f82
Compare
9cfa5a7
to
b242f32
Compare
…g time lmit Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
831926d
to
e51cad2
Compare
Signed-off-by: Wei Du <wedu@nvidia.com>
e51cad2
to
421d34c
Compare
Hi @terrykong, all DCO and pre-commit issues have been resolved, and the commits are now properly signed. Please help approve the pending workflows and review the change request when convenient — thanks! |
@wedu-nvidia looks like there are still some failures, this time with pyrefly
|
@terrykong May I know what is the status for this now? |
Signed-off-by: Wei Du <wedu@nvidia.com>
@terrykong I added another parameter, and hope it can pass all. |
Signed-off-by: Wei Du <wedu@nvidia.com>
Head branch was pushed to by a user without write access
@terrykong The previous error seems solved and I saw another error and I added in
|
Signed-off-by: Wei Du <wedu@nvidia.com>
@terrykong Can you help add it the merge queue again? Thanks so much |
@terrykong can you put it into mergequeue? |
@terrykong Why I did not see the conflict? |
…VIDIA-NeMo#734) Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
commit b246e55 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 15:05:48 2025 -0700 update the script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 5315a6b Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Mon Aug 25 13:59:16 2025 -0700 script update Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit 4437402 Author: Youngeun Kwon <youngeunk@nvidia.com> Date: Tue Jul 15 17:42:23 2025 -0700 local Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> wip Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> add script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> update script Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> interactive Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com> commit b721703 Author: Charlie Truong <chtruong@nvidia.com> Date: Mon Aug 18 11:22:54 2025 -0500 build: Fix pytorch image ref in Dockerfile.ngc_pytorch (NVIDIA-NeMo#936) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit 70b9666 Author: Charlie Truong <chtruong@nvidia.com> Date: Sun Aug 17 21:17:58 2025 -0500 build: Add Dockerfile that uses NGC pytorch image (NVIDIA-NeMo#897) Signed-off-by: Charlie Truong <chtruong@nvidia.com> commit df31c1b Author: pjin-nvidia <pjin@nvidia.com> Date: Thu Aug 14 18:34:50 2025 -0700 feat: chunked logprob calculation with deferred fp32 cast to help with OOM (NVIDIA-NeMo#918) Signed-off-by: Peter Jin <pjin@nvidia.com> commit 83c6bfc Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Thu Aug 14 21:48:55 2025 +0800 refactor: split sync/async vllm worker ([1/2] of refactor vllm worker) (NVIDIA-NeMo#900) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 9f7825e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Thu Aug 14 12:38:27 2025 +0800 feat: Add TP to embed_tokens and lm_head for Gemma models (NVIDIA-NeMo#879) Signed-off-by: ruit <ruit@nvidia.com> commit e1f56c4 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Tue Aug 12 13:09:37 2025 -0700 feat: add diagnostic script for problematic embeddings (NVIDIA-NeMo#896) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 223bfa8 Author: Gerald Shen <119401249+gshennvm@users.noreply.github.com> Date: Mon Aug 11 18:19:52 2025 -0700 feat: add nemotron5 sharding (NVIDIA-NeMo#481) Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> commit 18b9e2c Author: Terry Kong <terrycurtiskong@gmail.com> Date: Mon Aug 11 15:08:52 2025 -0700 test: lower step count on gemma nightly test to finish within 4 hours (NVIDIA-NeMo#880) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 8fd8c96 Author: guyueh1 <140554423+guyueh1@users.noreply.github.com> Date: Mon Aug 11 10:46:29 2025 -0700 feat: Fix and enhances for Nsight system profiling (NVIDIA-NeMo#865) Signed-off-by: Guyue Huang <guyueh@nvidia.com> commit 2b87def Author: Qidong Su <soodoshll@gmail.com> Date: Fri Aug 8 18:54:20 2025 -0400 fix: OOM in deepscaler1.5b with sequence length = 16/24k (NVIDIA-NeMo#875) Signed-off-by: Qidong Su <qidongs@nvidia.com> commit fecf71e Author: Rayen <130129397+RayenTian@users.noreply.github.com> Date: Sat Aug 9 06:42:07 2025 +0800 fix: remove tie weight check (NVIDIA-NeMo#700) Signed-off-by: ruit <ruit@nvidia.com> commit d45ff3f Author: Terry Kong <terrycurtiskong@gmail.com> Date: Fri Aug 8 10:07:02 2025 -0700 test: add deepscaler tests + pipe-clean configs + fix eval for deepscaler (NVIDIA-NeMo#866) Signed-off-by: Terry Kong <terryk@nvidia.com> commit d73c942 Author: Anna Shors <ashors@nvidia.com> Date: Fri Aug 8 09:27:15 2025 -0700 feat: qwen3 export to HF (NVIDIA-NeMo#873) Signed-off-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Signed-off-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> commit e924d33 Author: Shang Wang <samshang.wang@mail.utoronto.ca> Date: Fri Aug 8 12:15:34 2025 -0400 docs: Link uv's installation instructions to uv's website (NVIDIA-NeMo#837) Signed-off-by: Shang Wang <samshang.wang@mail.utoronto.ca> commit bbbb3d6 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 23:26:15 2025 +0800 fix: fix non-colocated with cpu_offload enabled (NVIDIA-NeMo#861) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 88a399e Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 14:04:08 2025 +0800 chore: remove old fsdp1 unit test (NVIDIA-NeMo#871) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit b8a89a9 Author: yuki <48991475+yuki-666@users.noreply.github.com> Date: Fri Aug 8 13:56:19 2025 +0800 feat: support non-colocated in mcore (NVIDIA-NeMo#613) Signed-off-by: Yuki Huang <yukih@nvidia.com> commit 5910abb Author: Anna Shors <ashors@nvidia.com> Date: Thu Aug 7 13:11:43 2025 -0700 feat: support DTensor CP in DPO and SFT (NVIDIA-NeMo#798) Signed-off-by: ashors1 <ashors@nvidia.com> commit 0988a7d Author: Felipe Vieira Frujeri <ffrujeri@gmail.com> Date: Wed Aug 6 22:01:32 2025 -0700 fix: Fix error message in VllmGenerationWorker. (NVIDIA-NeMo#633) Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com> commit 233cc07 Author: Parth Chadha <pchadha@nvidia.com> Date: Wed Aug 6 15:14:22 2025 -0700 fix: force use of eager (disabled cuda graphs) due to convergence issues (NVIDIA-NeMo#857) Signed-off-by: Parth Chadha <pchadha@nvidia.com> commit 0557402 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:44:29 2025 -0700 chore: 0.3.0 -> 0.4.0rc0 (NVIDIA-NeMo#840) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 03472a0 Author: Terry Kong <terrycurtiskong@gmail.com> Date: Wed Aug 6 14:43:55 2025 -0700 feat: dockerfile can build hermetically or from build context (NVIDIA-NeMo#799) Signed-off-by: Terry Kong <terryk@nvidia.com> commit 9af0a52 Author: Anna Shors <ashors@nvidia.com> Date: Wed Aug 6 12:35:51 2025 -0700 fix: fix grpo + mcore checkpointing without validation (NVIDIA-NeMo#844) Signed-off-by: ashors1 <ashors@nvidia.com> commit b6269f7 Author: Yubo Gao <yubog@nvidia.com> Date: Tue Aug 5 16:55:02 2025 -0400 feat: track policy training compute throughput (NVIDIA-NeMo#632) Signed-off-by: Yubo Gao <yubog@nvidia.com> commit b74c5d0 Author: Wei Du <wedu@nvidia.com> Date: Tue Aug 5 15:05:13 2025 -0500 feat: save checkpoint before timeout to avoid 4-hour runtime limit (NVIDIA-NeMo#734) Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> commit c784dd9 Author: Zhiyu Li <zhiyul@NVIDIA.com> Date: Tue Aug 5 10:47:30 2025 -0700 feat: add data shuffle and random seed option (NVIDIA-NeMo#334) Signed-off-by: Zhiyu Li <zhiyul@nvidia.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com> commit c249efc Author: Abdalgader Abubaker <136640907+abdalgader-a@users.noreply.github.com> Date: Tue Aug 5 21:33:28 2025 +0400 docs: fix checkpointing command for megatron->hf export (NVIDIA-NeMo#823) Signed-off-by: abdalgader-a <abdalgader.abubaker@tii.ae> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
…g time lmit
What does this PR do ?
Since the server automatically stops after 4 hours, it's recommended to save a checkpoint beforehand. For example, set the timeout to 3 hours and 45 minutes to ensure check point saved is saved in time
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
Additional Information