-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[megatron] support megatron expert parallel #1467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please add support for SGLang part. We will also help to evalute. |
SGLang part is added, please help to evaluate, thanks |
@GeLee-Q is working on validaiton. thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with qwen3-30b vllm, notice that some loss(kl loss) have more divergence with fsdp, but not influence the test results.
Might be related with gradient_accumulation
var? Will fix with the dynamic_bsz PR
### Checklist Before Starting ### What does this PR do? support expert parallel in megatron ### High-Level Design introduce EPsize and ETPsize ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP ### Specific Changes 1. mcore model initilize 2. megatron vllm parameter transfer ### API ### Usage Example ```bash LLM=models/Qwen1.5-MoE-A2.7B-Chat NODES=1 PP=2 TP=4 VLLM_TP=4 EP=4 ETP=1 python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\ algorithm.adv_estimator=gae \ data.train_files="$train_files" \ data.val_files="$test_files" \ data.train_batch_size=128 \ data.max_prompt_length=1024 \ data.max_response_length=512 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=$LLM \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \ critic.optim.lr=1e-5 \ critic.model.path=$LLM \ critic.model.enable_gradient_checkpointing=False \ critic.ppo_micro_batch_size_per_gpu=1 \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='verl_megatron_gsm8k_examples' \ trainer.experiment_name='qwen_moe_instruct_1node_ep' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=$NODES \ trainer.save_freq=-1 \ trainer.test_freq=5 \ actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \ critic.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \ critic.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \ critic.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \ critic.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \ critic.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.actor.megatron.param_offload=True \ actor_rollout_ref.ref.megatron.param_offload=True \ critic.megatron.param_offload=True \ trainer.total_epochs=100 $@ ``` ### Test ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
### Checklist Before Starting ### What does this PR do? support expert parallel in megatron ### High-Level Design introduce EPsize and ETPsize ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP ### Specific Changes 1. mcore model initilize 2. megatron vllm parameter transfer ### API ### Usage Example ```bash LLM=models/Qwen1.5-MoE-A2.7B-Chat NODES=1 PP=2 TP=4 VLLM_TP=4 EP=4 ETP=1 python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\ algorithm.adv_estimator=gae \ data.train_files="$train_files" \ data.val_files="$test_files" \ data.train_batch_size=128 \ data.max_prompt_length=1024 \ data.max_response_length=512 \ data.filter_overlong_prompts=True \ data.truncation='error' \ actor_rollout_ref.model.path=$LLM \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=32 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \ critic.optim.lr=1e-5 \ critic.model.path=$LLM \ critic.model.enable_gradient_checkpointing=False \ critic.ppo_micro_batch_size_per_gpu=1 \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='verl_megatron_gsm8k_examples' \ trainer.experiment_name='qwen_moe_instruct_1node_ep' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=$NODES \ trainer.save_freq=-1 \ trainer.test_freq=5 \ actor_rollout_ref.rollout.tensor_model_parallel_size=$VLLM_TP \ actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \ critic.megatron.pipeline_model_parallel_size=$PP \ actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \ critic.megatron.tensor_model_parallel_size=$TP \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \ critic.megatron.expert_model_parallel_size=$EP \ actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \ critic.megatron.expert_tensor_parallel_size=$ETP \ actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \ actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \ critic.megatron.use_dist_checkpointing=True \ actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ critic.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \ actor_rollout_ref.actor.megatron.param_offload=True \ actor_rollout_ref.ref.megatron.param_offload=True \ critic.megatron.param_offload=True \ trainer.total_epochs=100 $@ ``` ### Test ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This simple PR adds support for [ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938) offloading in the Megatron-LM training environment. In Megatron-LM, ChainedOptimizer is used when expert parallelism (expert_parallel > 1, related to #1467 ) is enabled—commonly in Mixture-of-Experts (MoE) models. This has been tested and validated with the Qwen3-235B-22A model configuration. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python ... actor_rollout_ref.actor.megatron.optimizer_offload=True \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \ ... ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Megatron] - **Inference**: [none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
…1638) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This simple PR adds support for [ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938) offloading in the Megatron-LM training environment. In Megatron-LM, ChainedOptimizer is used when expert parallelism (expert_parallel > 1, related to volcengine#1467 ) is enabled—commonly in Mixture-of-Experts (MoE) models. This has been tested and validated with the Qwen3-235B-22A model configuration. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python ... actor_rollout_ref.actor.megatron.optimizer_offload=True \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \ ... ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Megatron] - **Inference**: [none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
…1638) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This simple PR adds support for [ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938) offloading in the Megatron-LM training environment. In Megatron-LM, ChainedOptimizer is used when expert parallelism (expert_parallel > 1, related to volcengine#1467 ) is enabled—commonly in Mixture-of-Experts (MoE) models. This has been tested and validated with the Qwen3-235B-22A model configuration. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python ... actor_rollout_ref.actor.megatron.optimizer_offload=True \ actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \ ... ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Megatron] - **Inference**: [none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary. --------- Co-authored-by: charlie.cs <charlie.cs@kakaocorp.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
Checklist Before Starting
What does this PR do?
support expert parallel in megatron
High-Level Design
introduce EPsize and ETPsize
ETPsize is the TPsize for MoE parts, recommended to set 1, meaning that MoE parts not use TP
Specific Changes
API
Usage Example
Test
Additional Info.
Checklist Before Submitting
[BREAKING]
to the PR title if it breaks any API.