Skip to content

Conversation

ISEEKYAN
Copy link
Contributor

@ISEEKYAN ISEEKYAN commented May 9, 2025

Checklist Before Starting

What does this PR do?

support expert parallel in megatron

High-Level Design

introduce EPsize and ETPsize
ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP

Specific Changes

  1. mcore model initilize
  2. megatron vllm parameter transfer

API

Usage Example

LLM=models/Qwen1.5-MoE-A2.7B-Chat
NODES=1
PP=2
TP=4
VLLM_TP=4
EP=4
ETP=1

python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=128 \
    data.max_prompt_length=1024 \
    data.max_response_length=512 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=$LLM \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    critic.optim.lr=1e-5 \
    critic.model.path=$LLM \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=1 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_megatron_gsm8k_examples' \
    trainer.experiment_name='qwen_moe_instruct_1node_ep' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=$NODES \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \
    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \
    critic.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \
    critic.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \
    actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \
    critic.megatron.expert_model_parallel_size=$EP \
    actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \
    actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \
    critic.megatron.expert_tensor_parallel_size=$ETP \
    actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
    critic.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.actor.megatron.param_offload=True \
    actor_rollout_ref.ref.megatron.param_offload=True \
    critic.megatron.param_offload=True \
    trainer.total_epochs=100 $@

Test

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if neccessary.

@ccclyu ccclyu self-requested a review May 10, 2025 08:13
@zhaochenyang20
Copy link
Collaborator

Please add support for SGLang part. We will also help to evalute.

@ISEEKYAN
Copy link
Contributor Author

Please add support for SGLang part. We will also help to evalute.

SGLang part is added, please help to evaluate, thanks

@zhaochenyang20
Copy link
Collaborator

Please add support for SGLang part. We will also help to evalute.

SGLang part is added, please help to evaluate, thanks

@GeLee-Q is working on validaiton. thanks!

Copy link
Collaborator

@ETOgaosion ETOgaosion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with qwen3-30b vllm, notice that some loss(kl loss) have more divergence with fsdp, but not influence the test results.

Might be related with gradient_accumulation var? Will fix with the dynamic_bsz PR

@ETOgaosion ETOgaosion merged commit add17f0 into volcengine:main May 21, 2025
35 checks passed
cedricbeta pushed a commit to cedricbeta/verl that referenced this pull request May 21, 2025
### Checklist Before Starting


### What does this PR do?

support expert parallel in megatron


### High-Level Design

introduce EPsize and ETPsize
ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that
MoE parts not use TP


### Specific Changes

1. mcore model initilize
2. megatron vllm parameter transfer

### API

### Usage Example


```bash
LLM=models/Qwen1.5-MoE-A2.7B-Chat
NODES=1
PP=2
TP=4
VLLM_TP=4
EP=4
ETP=1

python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=128 \
    data.max_prompt_length=1024 \
    data.max_response_length=512 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=$LLM \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    critic.optim.lr=1e-5 \
    critic.model.path=$LLM \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=1 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_megatron_gsm8k_examples' \
    trainer.experiment_name='qwen_moe_instruct_1node_ep' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=$NODES \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \
    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \
    critic.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \
    critic.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \
    actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \
    critic.megatron.expert_model_parallel_size=$EP \
    actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \
    actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \
    critic.megatron.expert_tensor_parallel_size=$ETP \
    actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
    critic.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.actor.megatron.param_offload=True \
    actor_rollout_ref.ref.megatron.param_offload=True \
    critic.megatron.param_offload=True \
    trainer.total_epochs=100 $@
```

### Test


### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request May 22, 2025
### Checklist Before Starting


### What does this PR do?

support expert parallel in megatron


### High-Level Design

introduce EPsize and ETPsize
ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that
MoE parts not use TP


### Specific Changes

1. mcore model initilize
2. megatron vllm parameter transfer

### API

### Usage Example


```bash
LLM=models/Qwen1.5-MoE-A2.7B-Chat
NODES=1
PP=2
TP=4
VLLM_TP=4
EP=4
ETP=1

python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=128 \
    data.max_prompt_length=1024 \
    data.max_response_length=512 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=$LLM \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    critic.optim.lr=1e-5 \
    critic.model.path=$LLM \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=1 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_megatron_gsm8k_examples' \
    trainer.experiment_name='qwen_moe_instruct_1node_ep' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=$NODES \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \
    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \
    critic.megatron.pipeline_model_parallel_size=$PP \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \
    critic.megatron.tensor_model_parallel_size=$TP \
    actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \
    actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \
    critic.megatron.expert_model_parallel_size=$EP \
    actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \
    actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \
    critic.megatron.expert_tensor_parallel_size=$ETP \
    actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
    critic.megatron.use_dist_checkpointing=True \
    actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
    actor_rollout_ref.actor.megatron.param_offload=True \
    actor_rollout_ref.ref.megatron.param_offload=True \
    critic.megatron.param_offload=True \
    trainer.total_epochs=100 $@
```

### Test


### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Note which backend this PR will affect: FSDP, Megatron,
both, or none]
- **Inference**: [Note which backend this PR will affect: vLLM, SGLang,
both, or none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.

---------

Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
ETOgaosion added a commit that referenced this pull request May 24, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This simple PR adds support for
[ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938)
offloading in the Megatron-LM training environment.

In Megatron-LM, ChainedOptimizer is used when expert parallelism
(expert_parallel > 1, related to #1467 ) is enabled—commonly in
Mixture-of-Experts (MoE) models.

This has been tested and validated with the Qwen3-235B-22A model
configuration.


### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
...
actor_rollout_ref.actor.megatron.optimizer_offload=True \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \
...
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Megatron]
- **Inference**: [none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com>
Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
ETOgaosion added a commit to Jianbing-D/verl that referenced this pull request Jun 8, 2025
…1638)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This simple PR adds support for
[ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938)
offloading in the Megatron-LM training environment.

In Megatron-LM, ChainedOptimizer is used when expert parallelism
(expert_parallel > 1, related to volcengine#1467 ) is enabled—commonly in
Mixture-of-Experts (MoE) models.

This has been tested and validated with the Qwen3-235B-22A model
configuration.


### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
...
actor_rollout_ref.actor.megatron.optimizer_offload=True \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \
...
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Megatron]
- **Inference**: [none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com>
Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
…1638)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This simple PR adds support for
[ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938)
offloading in the Megatron-LM training environment.

In Megatron-LM, ChainedOptimizer is used when expert parallelism
(expert_parallel > 1, related to volcengine#1467 ) is enabled—commonly in
Mixture-of-Experts (MoE) models.

This has been tested and validated with the Qwen3-235B-22A model
configuration.


### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
...
actor_rollout_ref.actor.megatron.optimizer_offload=True \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \
...
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Megatron]
- **Inference**: [none]

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.

---------

Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com>
Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants