Skip to content

Add CI test for local checkpointing #13012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 6, 2025
Merged

Conversation

ananthsub
Copy link
Collaborator

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Adds single node 2gpu test for local checkpointing support

Collection: llm

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the CI label Apr 14, 2025
@ananthsub ananthsub changed the title Local ckpt test Add CI test for local checkpointing Apr 14, 2025
@ananthsub ananthsub added Run CICD r2.3.0 Pick this label for auto-cherrypicking into v2.3.0 labels Apr 23, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 23, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 23, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 29, 2025
Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 29, 2025
maanug-nv
maanug-nv previously approved these changes Apr 29, 2025
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Copy link
Collaborator

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good from automation perspective, let me know when to merge

@ananthsub
Copy link
Collaborator Author

@ko3n1g this is good to merge

@ko3n1g ko3n1g merged commit f2db26d into NVIDIA:main May 6, 2025
204 checks passed
ananthsub added a commit to ananthsub/NeMo that referenced this pull request May 6, 2025
* Add end to end test for local checkpoint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Update tests/functional_tests/L2_NeMo_2_llama3_local_ckpt.sh

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address comments from review

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test script

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* clean up assertions in log

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* combine converage files

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use nemo run for test launch

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase after new CI workflow change

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>

* Update .github/workflows/cicd-main-nemo2.yml

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
ananthsub added a commit to ananthsub/NeMo that referenced this pull request May 6, 2025
* Add end to end test for local checkpoint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Update tests/functional_tests/L2_NeMo_2_llama3_local_ckpt.sh

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address comments from review

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test script

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* clean up assertions in log

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* combine converage files

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use nemo run for test launch

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase after new CI workflow change

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>

* Update .github/workflows/cicd-main-nemo2.yml

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
chtruong814 pushed a commit that referenced this pull request May 16, 2025
* Add end to end test for local checkpoint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Update tests/functional_tests/L2_NeMo_2_llama3_local_ckpt.sh

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address comments from review

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test script

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* clean up assertions in log

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* combine converage files

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use nemo run for test launch

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase after new CI workflow change

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>

* Update .github/workflows/cicd-main-nemo2.yml

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
ananthsub added a commit to ananthsub/NeMo that referenced this pull request May 16, 2025
* Add end to end test for local checkpoint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Update tests/functional_tests/L2_NeMo_2_llama3_local_ckpt.sh

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address comments from review

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test script

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* clean up assertions in log

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* combine converage files

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use nemo run for test launch

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase after new CI workflow change

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>

* Update .github/workflows/cicd-main-nemo2.yml

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
chtruong814 added a commit that referenced this pull request May 16, 2025
….0` (#13472)

* Add CI test for local checkpointing (#13012)

* Add end to end test for local checkpoint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update tests

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lints

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Update tests/functional_tests/L2_NeMo_2_llama3_local_ckpt.sh

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* address comments from review

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test script

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* update test

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* lint

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* clean up assertions in log

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* combine converage files

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* use nemo run for test launch

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* rebase after new CI workflow change

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>

* Update .github/workflows/cicd-main-nemo2.yml

Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>

* Add local checkpoint test to CI file

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

---------

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Ananth Subramaniam <ananth.subramaniam@gmail.com>
Signed-off-by: ananthsub <ananthsub@users.noreply.github.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: ananthsub <ananthsub@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI r2.3.0 Pick this label for auto-cherrypicking into v2.3.0 Run CICD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants