-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[BugFix][CI] Megatron: add ep CI #1726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
15b is too big. I guess we should create a tiny fake model |
@vermouth1992 Currently the |
Sure, I guess we can test a small bsz and seqlen to ensure it will never cause OOM |
The CI system seems broken, could you reopen the PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RuntimeError: CUDA error: out of memory
:(
15b is too large to perform CI. We can create a model with random weights at the beginning to perform CI. |
### Checklist Before Starting - [ ] Search for similar PR(s). ### What does this PR do? Fix ep bug and try to add CI with 15B model, finding smaller models which are more convenient to test. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary.
Checklist Before Starting
What does this PR do?
Fix ep bug and try to add CI with 15B model, finding smaller models which are more convenient to test.
High-Level Design
Specific Changes
API
Usage Example
# Add code snippet or script demonstrating how to use this
Test
Additional Info.
Checklist Before Submitting
[BREAKING]
to the PR title if it breaks any API.