-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[mcore] qwen2moe support #1139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mcore] qwen2moe support #1139
Conversation
There seems to be a bug with computing gsm8k value over multiple nodes traning. |
Could you fix code format? Also, we have a qwen moe weight loader patcher yesterday.. |
can you show more details about your qwen moe weight loader patcher, I will check and merge the two implementations. |
Here: #1137 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Job!
Note that there seems to be multiple weight converters currently, we may need to uniform them to a single functional unit.
Besides, transformer config needs some refine.
And maybe current dist checkpoint and converter need some ci test.
This PR is huge enough, merge first.
about conveter about transformer config, what direction is the refinement to? about the CI, the CI is necessary, let's add some in the next days. |
): | ||
return init_mcore_model_dense( | ||
tfconfig, hf_config, pre_process, post_process, share_embeddings_and_output_weights, value | ||
from megatron.core.models.gpt.gpt_layer_specs import get_gpt_decoder_block_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we use megatron.core
in the all init functions, can we move to the header of this file to avoid duplicate import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We will try if it does not influence the CPU initialization process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestions, we may refine the code in the next PRs
return transformer_layer_spec | ||
|
||
assert tfconfig.normalization == "RMSNorm", "only RMSNorm is supported for now" | ||
transformer_layer_spec = get_gpt_decoder_block_spec(tfconfig, use_transformer_engine=use_te) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we directly set use_transformer_engine
to True instead of using one variable use_te
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot. We may take this into consideration.
I have optimized the previous code in the new PR #1200. |
## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in #923, #1137, and #1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.
support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training
support qwen2moe structure to run with megatron-core including: * qwen2moe config converter * qwen2moe model initializer * refactor the online weight converter from mcore to vllm * qwen2moe online weight converter * qwen2moe offline weight conversion script from hf to mcore * a script to run training qwen1.5moe_a2.7b with 4 nodes TODO add option to freeze the MoE router weight during training
## Motivation This is a fix for the issue where the `weight_loader` in FusedMoe of the vLLM code could not be used correctly during the resharding phase, addressed in volcengine#923, volcengine#1137, and volcengine#1139 respectively. Currently, the results of these PRs can be used together, allow both FSDP and Megatron to use the same function, reducing code maintenance costs.
support qwen2moe structure to run with megatron-core
including:
TODO
add option to freeze the MoE router weight during training