Skip to content

Conversation

ISEEKYAN
Copy link
Contributor

support qwen2moe structure to run with megatron-core
including:

  • qwen2moe config converter
  • qwen2moe model initializer
  • refactor the online weight converter from mcore to vllm
  • qwen2moe online weight converter
  • qwen2moe offline weight conversion script from hf to mcore
  • a script to run training qwen1.5moe_a2.7b with 4 nodes

TODO
add option to freeze the MoE router weight during training

@ISEEKYAN
Copy link
Contributor Author

There seems to be a bug with computing gsm8k value over multiple nodes traning.
I trained qwen2-7b with 1/2/4 nodes, their scores are 0.88/0.44/0.22.
I trained qwen1.5-moe-a2.7b with 4 nodes(OOM with less nodes), which is reported to achieve 0.625 on GSM8k, the score is about 0.159, roughly 1/4 of 0.625.

Microsoft Edge 2025-04-17 21 22 13

@ISEEKYAN ISEEKYAN marked this pull request as ready for review April 17, 2025 13:28
@vermouth1992
Copy link
Collaborator

vermouth1992 commented Apr 18, 2025

Could you fix code format? Also, we have a qwen moe weight loader patcher yesterday..

@ISEEKYAN
Copy link
Contributor Author

Could you fix weight format? Also, we have a qwen moe weight loader patcher yesterday..

can you show more details about your qwen moe weight loader patcher, I will check and merge the two implementations.

@vermouth1992
Copy link
Collaborator

Could you fix weight format? Also, we have a qwen moe weight loader patcher yesterday..

can you show more details about your qwen moe weight loader patcher, I will check and merge the two implementations.

Here: #1137

@vermouth1992 vermouth1992 requested a review from ETOgaosion April 19, 2025 12:32
Copy link
Collaborator

@ETOgaosion ETOgaosion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Job!
Note that there seems to be multiple weight converters currently, we may need to uniform them to a single functional unit.
Besides, transformer config needs some refine.
And maybe current dist checkpoint and converter need some ci test.
This PR is huge enough, merge first.

@ETOgaosion ETOgaosion merged commit 4fa7ed6 into volcengine:main Apr 20, 2025
17 checks passed
@ISEEKYAN
Copy link
Contributor Author

Great Job! Note that there seems to be multiple weight converters currently, we may need to uniform them to a single functional unit. Besides, transformer config needs some refine. And maybe current dist checkpoint and converter need some ci test. This PR is huge enough, merge first.

about conveter
I notice that the weight layout and names are difference across the different model structures, e.g. Qwen2ForCausalLM/Qwen2MoeForCausalLM/DeepseekV3ForCausalLM. So I introduced an OOP design, which is, every structure has a converter class that inherit from the base class, making it easy to reuse the common codes such as the attention parts.

about transformer config, what direction is the refinement to?

about the CI, the CI is necessary, let's add some in the next days.

):
return init_mcore_model_dense(
tfconfig, hf_config, pre_process, post_process, share_embeddings_and_output_weights, value
from megatron.core.models.gpt.gpt_layer_specs import get_gpt_decoder_block_spec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we use megatron.core in the all init functions, can we move to the header of this file to avoid duplicate import?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We will try if it does not influence the CPU initialization process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestions, we may refine the code in the next PRs

return transformer_layer_spec

assert tfconfig.normalization == "RMSNorm", "only RMSNorm is supported for now"
transformer_layer_spec = get_gpt_decoder_block_spec(tfconfig, use_transformer_engine=use_te)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we directly set use_transformer_engine to True instead of using one variable use_te?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. We may take this into consideration.

@BearBiscuit05
Copy link
Collaborator

Could you fix code format? Also, we have a qwen moe weight loader patcher yesterday..

I have optimized the previous code in the new PR #1200.

vermouth1992 pushed a commit that referenced this pull request Apr 24, 2025
## Motivation
This is a fix for the issue where the `weight_loader` in FusedMoe of the
vLLM code could not be used correctly during the resharding phase,
addressed in #923, #1137, and #1139 respectively. Currently, the results
of these PRs can be used together, allow both FSDP and Megatron to use
the same function, reducing code maintenance costs.
yuchenwang3 pushed a commit to yuchenwang3/verl that referenced this pull request Apr 25, 2025
support qwen2moe structure to run with megatron-core
including:
* qwen2moe config converter 
* qwen2moe model initializer
* refactor the online weight converter from mcore to vllm
* qwen2moe online weight converter
* qwen2moe offline weight conversion script from hf to mcore
* a script to run training qwen1.5moe_a2.7b with 4 nodes

TODO
add option to freeze the MoE router weight during training
yhyang201 pushed a commit to yhyang201/verl that referenced this pull request Apr 26, 2025
support qwen2moe structure to run with megatron-core
including:
* qwen2moe config converter 
* qwen2moe model initializer
* refactor the online weight converter from mcore to vllm
* qwen2moe online weight converter
* qwen2moe offline weight conversion script from hf to mcore
* a script to run training qwen1.5moe_a2.7b with 4 nodes

TODO
add option to freeze the MoE router weight during training
ScottCTD pushed a commit to ScottCTD/verl that referenced this pull request May 5, 2025
## Motivation
This is a fix for the issue where the `weight_loader` in FusedMoe of the
vLLM code could not be used correctly during the resharding phase,
addressed in volcengine#923, volcengine#1137, and volcengine#1139 respectively. Currently, the results
of these PRs can be used together, allow both FSDP and Megatron to use
the same function, reducing code maintenance costs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants