[BugFix][CI] Megatron: add ep CI #1726

ETOgaosion · 2025-05-27T12:50:37Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Fix ep bug and try to add CI with 15B model, finding smaller models which are more convenient to test.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

vermouth1992 · 2025-05-27T14:23:04Z

15b is too big. I guess we should create a tiny fake model

ETOgaosion · 2025-05-27T14:46:56Z

@vermouth1992 Currently the checkpoint_converter test already involves the Qwen/Qwen1.5-MoE-A2.7B-Chat 15B model, so the CI machines have no need to download this, so we can directly test 1 step to find whether it works well.

vermouth1992 · 2025-05-27T14:50:16Z

Sure, I guess we can test a small bsz and seqlen to ensure it will never cause OOM

vermouth1992 · 2025-05-29T01:17:55Z

The CI system seems broken, could you reopen the PR?

eric-haibin-lin

RuntimeError: CUDA error: out of memory
:(

vermouth1992 · 2025-06-03T00:30:42Z

15b is too large to perform CI. We can create a model with random weights at the beginning to perform CI.

### Checklist Before Starting - [ ] Search for similar PR(s). ### What does this PR do? Fix ep bug and try to add CI with 15B model, finding smaller models which are more convenient to test. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary.

ETOgaosion marked this pull request as draft May 27, 2025 14:42

ETOgaosion marked this pull request as ready for review May 27, 2025 14:45

ETOgaosion closed this May 29, 2025

ETOgaosion reopened this May 29, 2025

ETOgaosion force-pushed the fix_ep branch from 919bbc2 to 9fca48b Compare May 29, 2025 07:58

ETOgaosion and others added 7 commits June 2, 2025 15:14

try add ep CI

1c9b64b

fix file name

27cb304

runnable 15B ep

1870c7b

fix arguments

248a360

fix typo

64852e2

fix range

17348fa

oom if save checkpoints

773df2e

ETOgaosion force-pushed the fix_ep branch from c0e7606 to 773df2e Compare June 2, 2025 07:14

eric-haibin-lin reviewed Jun 2, 2025

View reviewed changes

ETOgaosion and others added 4 commits June 3, 2025 13:45

use minimal moe model

e5be3d5

fix bugs

21b75da

use dummy dist checkpoint path

ae14672

fix use dummy model

df42a29

vermouth1992 approved these changes Jun 5, 2025

View reviewed changes

vermouth1992 merged commit 2a386cf into volcengine:main Jun 5, 2025
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix][CI] Megatron: add ep CI #1726

[BugFix][CI] Megatron: add ep CI #1726

Uh oh!

ETOgaosion commented May 27, 2025

Uh oh!

vermouth1992 commented May 27, 2025

Uh oh!

ETOgaosion commented May 27, 2025

Uh oh!

vermouth1992 commented May 27, 2025

Uh oh!

vermouth1992 commented May 29, 2025

Uh oh!

eric-haibin-lin left a comment

Uh oh!

vermouth1992 commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

[BugFix][CI] Megatron: add ep CI #1726

[BugFix][CI] Megatron: add ep CI #1726

Uh oh!

Conversation

ETOgaosion commented May 27, 2025

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

vermouth1992 commented May 27, 2025

Uh oh!

ETOgaosion commented May 27, 2025

Uh oh!

vermouth1992 commented May 27, 2025

Uh oh!

vermouth1992 commented May 29, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

vermouth1992 commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!