-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[Model]Add Tencent HunYuanMoEV1 Model Support #20114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @aiyiwang2025, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive support for Tencent's HunYuan V1 models, encompassing both dense and Mixture-of-Experts (MoE) variants, into the vLLM framework. This integration enables efficient inference for these new models by incorporating their specific architectural components, including specialized rotary embeddings and advanced attention mechanisms, and by registering them within the vLLM ecosystem.
Highlights
- New Model Definition: A new module
hunyuan_v1.py
has been added, defining the core components for Tencent's HunYuan models. This includes their specific MLP, attention mechanisms (supporting self and cross-attention), decoder layers, and the overall model structure, enabling vLLM to support these new architectures. - Mixture-of-Experts (MoE) Support: The pull request introduces the
HunYuanSparseMoeBlock
to specifically handle the Mixture-of-Experts (MoE) architecture found in HunYuanMoEV1. This includes logic for expert routing and shared MLP components, crucial for efficient inference of MoE models. - Dynamic NTK Alpha Rotary Embedding: A new rotary embedding class,
DynamicNTKAlphaRotaryEmbedding
, has been implemented and integrated into theget_rope
function. This addition supports HunYuan's unique dynamic NTK alpha scaling for positional embeddings, ensuring accurate model behavior. - Cross-Layer Attention (CLA) Handling: The HunYuan model architecture now includes specific logic to manage and utilize
prev_kv_states
for Cross-Layer Attention (CLA), which is a feature potentially used in HunYuan models to enhance performance or context understanding. - Model Registration: Both
HunYuanDenseV1ForCausalLM
andHunYuanMoEV1ForCausalLM
models are now registered within the vLLM model registry, making them discoverable and loadable by the framework for inference.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces support for Tencent's HunYuanMoEV1 model within the vLLM framework. It includes the addition of a new rotary embedding and a model definition file, along with modifications to the model registry. The changes appear to be well-structured and validated, but there are a few opportunities to improve code clarity and remove redundant code.
def _compute_cos_sin_cache(self) -> torch.Tensor: | ||
# For Hunyuan DynamicNTKAlphaRotaryEmbedding | ||
max_len = self.max_position_embeddings | ||
base = self.base * self.scaling_alpha ** (self.rotary_dim / (self.rotary_dim-2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding parentheses to improve readability and explicitly define the order of operations in this calculation. It's a good practice to make the intent clear, especially when dealing with multiple operators1.
Style Guide References
base = self.base * self.scaling_alpha ** (self.rotary_dim / (self.rotary_dim-2)) | |
base = self.base * (self.scaling_alpha ** (self.rotary_dim / (self.rotary_dim-2))) | |
Footnotes
-
While there is no explicit style guide provided, using parentheses for clarity in complex calculations is a common practice to improve code readability. ↩
k_tmp = torch.empty_like(k) # Todo: reduant rotary embedding | ||
q, _ = self.rotary_emb(positions, q, k_tmp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rotary embedding is marked as redundant. It should be removed to avoid unnecessary computation and memory allocation. If it's truly needed in some cases, clarify the conditions under which it's used, otherwise, remove it entirely. Leaving dead code can lead to confusion and maintenance issues1.
k_tmp = torch.empty_like(k) # Todo: reduant rotary embedding
q, _ = self.rotary_emb(positions, q, k_tmp)
Style Guide References
Footnotes
-
While there is no explicit style guide provided, avoiding dead code is a common practice to improve code maintainability. ↩
Thanks for implementing your model in vLLM! Can you add it to the list of supported models and update the tests as well? |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add some initial comments with a glance, will take a deeper look later, PTAL!
bitsandbytes_stacked_params_mapping = { | ||
# shard_name, weight_name, index | ||
"q_proj": ("qkv_proj", 0), | ||
"k_proj": ("qkv_proj", 1), | ||
"v_proj": ("qkv_proj", 2), | ||
"gate_proj": ("gate_up_proj", 0), | ||
"up_proj": ("gate_up_proj", 1), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bitsandbytes_stacked_params_mapping = { | |
# shard_name, weight_name, index | |
"q_proj": ("qkv_proj", 0), | |
"k_proj": ("qkv_proj", 1), | |
"v_proj": ("qkv_proj", 2), | |
"gate_proj": ("gate_up_proj", 0), | |
"up_proj": ("gate_up_proj", 1), | |
} |
This field isn't used anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, remove useless parts
class HunYuanAttention(nn.Module): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to decouple self-attn and cross-attn implementation into HunYuanAttention
and HunYuanCrossAttention
respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, HunYuanMoEV1ForCausalLM
does not use cross-attn
, but self-attn
, so keep HunYuanAttention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although cross-attn is unused, I think decouple this can improve the readability, otherwise this attention layer implementation is a little bit too long to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Decouple self-attn and cross-attn implementation into HunYuanAttention and HunYuanCrossAttention respectively.
docs/models/supported_models.md
Outdated
@@ -388,6 +388,7 @@ Specified using `--task generate`. | |||
| `MiniMaxM1ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-M1-40k`, `MiniMaxAI/MiniMax-M1-80k`etc. | | | | | |||
| `MiniMaxText01ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-Text-01`, etc. | | | | | |||
| `Zamba2ForCausalLM` | Zamba2 | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc. | | | | | |||
| `HunYuanMoEV1ForCausalLM` | Hunyuan-80B-A13B | `tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`etc. | | | ✅︎ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we haven't supported cross attention in v1 yet, does this model work with v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is self-attn
, it currently supports v1 and has been verified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the dense model in document? And seems that PP should also support too?
@@ -0,0 +1,851 @@ | |||
# coding=utf-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# coding=utf-8 | |
# SPDX-License-Identifier: Apache-2.0 | |
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project | |
# coding=utf-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"embed_tokens": "input_embeddings", | ||
"lm_head": "output_embeddings", | ||
} | ||
embedding_padding_modules = ["lm_head"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to explicitly set embedding_modules
and embedding_padding_modules
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, remove useless parts
864f1bc
to
06c60c3
Compare
"HunYuanDenseV1ForCausalLM": ("hunyuan_v1", "HunYuanV1ForCausalLM"), | ||
"HunYuanMoEV1ForCausalLM": ("hunyuan_v1", "HunYuanV1ForCausalLM"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have better decouple the dense model and MoE model implementation, because MoE model implementation has a more complicated weight loading logic, which makes maintenance difficult with couplings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Decouple the dense model and MoE model implementation
class HunYuanAttention(nn.Module): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although cross-attn is unused, I think decouple this can improve the readability, otherwise this attention layer implementation is a little bit too long to read.
Co-authored-by: quinnrong <quinnrong@tencent.com> Signed-off-by: aiyiwang <aiyiwang@tencent.com>
8bf8ea6
to
a8ce7ea
Compare
Throw exception when load Hunyuan-A13B-Instruct-FP8 model. @aiyiwang2025
|
I also had the error that @xjpang got. Patched it by adding prefix to experts and gate modules git diff
|
got it。 Thanks. |
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
@xjpang In fact, I did not encounter the above problems. You can try the corresponding solutions @intervitens . I have also synchronized the corresponding code to this PR |
@aiyiwang2025 Can you provide reasoning and function calling parser? |
@xjpang reasoing parser is currently under development tool call parser can refer to this |
return final_hidden_states.view(orig_shape) | ||
|
||
|
||
class HunYuanAttention(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can reuse attention and mlp implementation from hunyuan_v1_dense.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Remove duplicate logic in hunyuan_v1_moe.py
self.attn = Attention( | ||
self.num_heads, | ||
self.head_dim, | ||
self.scaling, | ||
num_kv_heads=self.num_kv_heads, | ||
cache_config=cache_config, | ||
quant_config=quant_config, | ||
prefix=f"{prefix}.attn", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.attn = Attention( | |
self.num_heads, | |
self.head_dim, | |
self.scaling, | |
num_kv_heads=self.num_kv_heads, | |
cache_config=cache_config, | |
quant_config=quant_config, | |
prefix=f"{prefix}.attn", | |
) | |
self.attn = Attention( | |
self.num_heads, | |
self.head_dim, | |
self.scaling, | |
num_kv_heads=self.num_kv_heads, | |
cache_config=cache_config, | |
quant_config=quant_config, | |
prefix=f"{prefix}.attn", | |
attn_type=AttentionType.ENCODER_DECODER, | |
) |
We need to pass AttentionType.ENCODER_DECODER
for cross attn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
attention_type = ("cross" if layer_id >= 0 | ||
and layer_id % cla_factor != 0 else "self") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attention_type = ("cross" if layer_id >= 0 | |
and layer_id % cla_factor != 0 else "self") | |
attention_type = (AttentionType.ENCODER_DECODER if layer_id >= 0 | |
and layer_id % cla_factor != 0 else AttentionType.DECODER) |
We can use vLLM's AttentionType
enum here:
vllm/vllm/attention/backends/abstract.py
Lines 20 to 32 in 7b1895e
class AttentionType: | |
""" | |
Attention type. | |
Use string to be compatible with `torch.compile`. | |
""" | |
# Decoder attention between previous layer Q/K/V | |
DECODER = "decoder" | |
# Encoder attention between previous layer Q/K/V for encoder-decoder | |
ENCODER = "encoder" | |
# Encoder attention between previous layer Q/K/V | |
ENCODER_ONLY = "encoder_only" | |
# Attention between dec. Q and enc. K/V for encoder-decoder | |
ENCODER_DECODER = "encoder_decoder" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if "mlp.experts" in name: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "mlp.experts" in name: | |
continue |
Seems unnecessary for dense model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/models/supported_models.md
Outdated
@@ -388,6 +388,7 @@ Specified using `--task generate`. | |||
| `MiniMaxM1ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-M1-40k`, `MiniMaxAI/MiniMax-M1-80k`etc. | | | | | |||
| `MiniMaxText01ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-Text-01`, etc. | | | | | |||
| `Zamba2ForCausalLM` | Zamba2 | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc. | | | | | |||
| `HunYuanMoEV1ForCausalLM` | Hunyuan-80B-A13B | `tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`etc. | | | ✅︎ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the dense model in document? And seems that PP should also support too?
tests/models/registry.py
Outdated
@@ -259,6 +259,7 @@ def check_available_online( | |||
"Zamba2ForCausalLM": _HfExamplesInfo("Zyphra/Zamba2-7B-instruct"), | |||
"MiMoForCausalLM": _HfExamplesInfo("XiaomiMiMo/MiMo-7B-RL", | |||
trust_remote_code=True), | |||
"HunYuanMoEV1ForCausalLM": _HfExamplesInfo("tencent/Hunyuan-A13B-Instruct"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to register dense model here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Add HunYuanDenseV1ForCausalLM
Note:
We are currently working on some HF model governance. The architecture corresponding to the previously open Dense model is called HunYuanForCausalLM
. The subsequent Dense model will be called HunYuanDenseV1ForCausalLM
. If you want to run the previous model, you need to change the architecture. This PR does not include adaptation of the previous model.
is_neox_style = True | ||
if quant_config is not None and quant_config.get_name() == "gguf": | ||
is_neox_style = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only for Llama GGUF models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Remove related code
is_neox_style = True | ||
if quant_config is not None and quant_config.get_name() == "gguf": | ||
is_neox_style = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Remove related code
split_params_mapping = [ | ||
(".gate_up_proj", ".gate_and_up_proj", 2, [(1, 1), (0, 1)], None), | ||
( | ||
".qkv_proj", | ||
".qkv_proj", | ||
num_attention_heads + num_kv_heads * 2, | ||
[("q", num_attention_heads), ("k", num_kv_heads), | ||
("v", num_kv_heads)], | ||
self._split_qkv_weight, | ||
), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to split weights which have been stacked? I think we should be able to load it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because of some historical reasons. The weights that were originally combined together need to be split and then spliced because the specific layout does not meet the requirements, so the code here needs to be retained
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution. Considering that I have confirmed with the PR author that the generated results can be aligned, let's merge this PR first. Related improvements can be completed in subsequent PRs
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
hi ! will HunYuanConfig be included too ? |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Rafael Marcelino Koike <rafael.koike@oracle.com>
Looks like
|
Should be fixed by #20343, sorry for merging this |
Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Description
Currently, the Hunyuan inference team supports the Hunyuan-A13B model. By adding the
hunyuan_v1.py
related files, it supports the two models ofHunYuanDenseV1ForCausalLM
andHunYuanMoEV1ForCausalLM
.We have validated the accuracy of this PR,HunYuan (new MoE LLM model from Tencent) will open source these days.
Thanks~