[Model]Add Tencent HunYuanMoEV1 Model Support #20114

aiyiwang2025 · 2025-06-26T06:56:08Z

Description

Currently, the Hunyuan inference team supports the Hunyuan-A13B model. By adding the hunyuan_v1.py related files, it supports the two models of HunYuanDenseV1ForCausalLM and HunYuanMoEV1ForCausalLM.

We have validated the accuracy of this PR，HunYuan (new MoE LLM model from Tencent) will open source these days.

Thanks~

gemini-code-assist

Summary of Changes

Hello @aiyiwang2025, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for Tencent's HunYuan V1 models, encompassing both dense and Mixture-of-Experts (MoE) variants, into the vLLM framework. This integration enables efficient inference for these new models by incorporating their specific architectural components, including specialized rotary embeddings and advanced attention mechanisms, and by registering them within the vLLM ecosystem.

Highlights

New Model Definition: A new module hunyuan_v1.py has been added, defining the core components for Tencent's HunYuan models. This includes their specific MLP, attention mechanisms (supporting self and cross-attention), decoder layers, and the overall model structure, enabling vLLM to support these new architectures.
Mixture-of-Experts (MoE) Support: The pull request introduces the HunYuanSparseMoeBlock to specifically handle the Mixture-of-Experts (MoE) architecture found in HunYuanMoEV1. This includes logic for expert routing and shared MLP components, crucial for efficient inference of MoE models.
Dynamic NTK Alpha Rotary Embedding: A new rotary embedding class, DynamicNTKAlphaRotaryEmbedding, has been implemented and integrated into the get_rope function. This addition supports HunYuan's unique dynamic NTK alpha scaling for positional embeddings, ensuring accurate model behavior.
Cross-Layer Attention (CLA) Handling: The HunYuan model architecture now includes specific logic to manage and utilize prev_kv_states for Cross-Layer Attention (CLA), which is a feature potentially used in HunYuan models to enhance performance or context understanding.
Model Registration: Both HunYuanDenseV1ForCausalLM and HunYuanMoEV1ForCausalLM models are now registered within the vLLM model registry, making them discoverable and loadable by the framework for inference.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces support for Tencent's HunYuanMoEV1 model within the vLLM framework. It includes the addition of a new rotary embedding and a model definition file, along with modifications to the model registry. The changes appear to be well-structured and validated, but there are a few opportunities to improve code clarity and remove redundant code.

gemini-code-assist · 2025-06-26T06:58:07Z

vllm/model_executor/layers/rotary_embedding.py

+    def _compute_cos_sin_cache(self) -> torch.Tensor:
+        # For Hunyuan DynamicNTKAlphaRotaryEmbedding
+        max_len = self.max_position_embeddings
+        base = self.base * self.scaling_alpha ** (self.rotary_dim / (self.rotary_dim-2))


Consider adding parentheses to improve readability and explicitly define the order of operations in this calculation. It's a good practice to make the intent clear, especially when dealing with multiple operators¹.

Style Guide References

Suggested change

base = self.base * self.scaling_alpha ** (self.rotary_dim / (self.rotary_dim-2))

base = self.base * (self.scaling_alpha ** (self.rotary_dim / (self.rotary_dim-2)))

Footnotes

While there is no explicit style guide provided, using parentheses for clarity in complex calculations is a common practice to improve code readability. ↩

gemini-code-assist · 2025-06-26T06:58:07Z

vllm/model_executor/models/hunyuan_v1.py

+            k_tmp = torch.empty_like(k)  # Todo: reduant rotary embedding
+            q, _ = self.rotary_emb(positions, q, k_tmp)


This rotary embedding is marked as redundant. It should be removed to avoid unnecessary computation and memory allocation. If it's truly needed in some cases, clarify the conditions under which it's used, otherwise, remove it entirely. Leaving dead code can lead to confusion and maintenance issues¹.

k_tmp = torch.empty_like(k) # Todo: reduant rotary embedding q, _ = self.rotary_emb(positions, q, k_tmp)

Style Guide References

Footnotes

While there is no explicit style guide provided, avoiding dead code is a common practice to improve code maintainability. ↩

DarkLight1337 · 2025-06-26T07:07:28Z

Thanks for implementing your model in vLLM! Can you add it to the list of supported models and update the tests as well?

github-actions · 2025-06-26T07:07:39Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Isotr0py

Just add some initial comments with a glance, will take a deeper look later, PTAL!

Isotr0py · 2025-06-26T07:50:32Z

vllm/model_executor/models/hunyuan_v1.py

+    bitsandbytes_stacked_params_mapping = {
+        # shard_name, weight_name, index
+        "q_proj": ("qkv_proj", 0),
+        "k_proj": ("qkv_proj", 1),
+        "v_proj": ("qkv_proj", 2),
+        "gate_proj": ("gate_up_proj", 0),
+        "up_proj": ("gate_up_proj", 1),
+    }


Suggested change

bitsandbytes_stacked_params_mapping = {

# shard_name, weight_name, index

"q_proj": ("qkv_proj", 0),

"k_proj": ("qkv_proj", 1),

"v_proj": ("qkv_proj", 2),

"gate_proj": ("gate_up_proj", 0),

"up_proj": ("gate_up_proj", 1),

}

This field isn't used anymore.

Done, remove useless parts

Isotr0py · 2025-06-26T07:56:47Z

vllm/model_executor/models/hunyuan_v1.py

+class HunYuanAttention(nn.Module):
+


I suggest to decouple self-attn and cross-attn implementation into HunYuanAttention and HunYuanCrossAttention respectively.

Currently, HunYuanMoEV1ForCausalLM does not use cross-attn, but self-attn, so keep HunYuanAttention

Although cross-attn is unused, I think decouple this can improve the readability, otherwise this attention layer implementation is a little bit too long to read.

Done.

Decouple self-attn and cross-attn implementation into HunYuanAttention and HunYuanCrossAttention respectively.

Isotr0py · 2025-06-26T07:58:35Z

docs/models/supported_models.md

@@ -388,6 +388,7 @@ Specified using `--task generate`.
 | `MiniMaxM1ForCausalLM`                        | MiniMax-Text                                        | `MiniMaxAI/MiniMax-M1-40k`, `MiniMaxAI/MiniMax-M1-80k`etc.                                                                                                                                  |                        |                             |                       |
 | `MiniMaxText01ForCausalLM`                        | MiniMax-Text                                        | `MiniMaxAI/MiniMax-Text-01`, etc.                                                                                                                                            |                        |                             |                       |
 | `Zamba2ForCausalLM`                               | Zamba2                                              | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc.                                                                              |                        |                             |                       |
+| `HunYuanMoEV1ForCausalLM`                             | Hunyuan-80B-A13B                                 | `tencent/Hunyuan-A13B-Instruct`,  `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`etc.                                                                                                                                                   |                        |                          | ✅︎                     |


I think we haven't supported cross attention in v1 yet, does this model work with v1?

Because it is self-attn, it currently supports v1 and has been verified

Can you also update the dense model in document? And seems that PP should also support too?

jeejeelee · 2025-06-26T08:05:55Z

vllm/model_executor/models/hunyuan_v1.py

@@ -0,0 +1,851 @@
+# coding=utf-8


Suggested change

# coding=utf-8

# SPDX-License-Identifier: Apache-2.0

# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

# coding=utf-8

jeejeelee · 2025-06-26T08:07:24Z

vllm/model_executor/models/hunyuan_v1.py

+        "embed_tokens": "input_embeddings",
+        "lm_head": "output_embeddings",
+    }
+    embedding_padding_modules = ["lm_head"]


There is no need to explicitly set embedding_modules and embedding_padding_modules.

Done, remove useless parts

Isotr0py · 2025-06-26T10:16:47Z

vllm/model_executor/models/registry.py

+    "HunYuanDenseV1ForCausalLM": ("hunyuan_v1", "HunYuanV1ForCausalLM"),
+    "HunYuanMoEV1ForCausalLM": ("hunyuan_v1", "HunYuanV1ForCausalLM"),


I think we have better decouple the dense model and MoE model implementation, because MoE model implementation has a more complicated weight loading logic, which makes maintenance difficult with couplings.

Done.

Decouple the dense model and MoE model implementation

Isotr0py · 2025-06-26T10:31:09Z

vllm/model_executor/models/hunyuan_v1.py

+class HunYuanAttention(nn.Module):
+


Although cross-attn is unused, I think decouple this can improve the readability, otherwise this attention layer implementation is a little bit too long to read.

Co-authored-by: quinnrong <quinnrong@tencent.com> Signed-off-by: aiyiwang <aiyiwang@tencent.com>

xjpang · 2025-06-28T14:17:09Z

Throw exception when load Hunyuan-A13B-Instruct-FP8 model. @aiyiwang2025

^[[1;36m(VllmWorker rank=2 pid=31732)^[[0;0m ERROR 06-28 22:10:15 [multiproc_executor.py:487] self.mlp = HunYuanSparseMoeBlock(^M ^[[1;36m(VllmWorker rank=2 pid=31732)^[[0;0m ERROR 06-28 22:10:15 [multiproc_executor.py:487] File "/data/miniconda3/envs/env-3.10/lib/python3.10/site-packages/vllm/model_executor/models/hunyuan_v1_moe.py", line 154, in __init__^M ^[[1;36m(VllmWorker rank=2 pid=31732)^[[0;0m ERROR 06-28 22:10:15 [multiproc_executor.py:487] self.experts = FusedMoE(^M ^[[1;36m(VllmWorker rank=2 pid=31732)^[[0;0m ERROR 06-28 22:10:15 [multiproc_executor.py:487] File "/data/miniconda3/envs/env-3.10/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 866, in __init__^M ^[[1;36m(VllmWorker rank=2 pid=31732)^[[0;0m ERROR 06-28 22:10:15 [multiproc_executor.py:487] raise ValueError("Duplicate layer name: {}".format(prefix))^M ^[[1;36m(VllmWorker rank=2 pid=31732)^[[0;0m ERROR 06-28 22:10:15 [multiproc_executor.py:487] ValueError: Duplicate layer name:

intervitens · 2025-06-28T21:39:43Z

I also had the error that @xjpang got. Patched it by adding prefix to experts and gate modules

git diff

diff --git a/vllm/model_executor/models/hunyuan_v1_moe.py b/vllm/model_executor/models/hunyuan_v1_moe.py
index 1262434a8..54177acdf 100644
--- a/vllm/model_executor/models/hunyuan_v1_moe.py
+++ b/vllm/model_executor/models/hunyuan_v1_moe.py
@@ -124,6 +124,7 @@ class HunYuanSparseMoeBlock(nn.Module):
         config: PretrainedConfig,
         quant_config: Optional[QuantizationConfig] = None,
         layer_id: int = -1,
+        prefix: str = "",
     ):
         super().__init__()
         self.tp_size = get_tensor_model_parallel_world_size()
@@ -159,10 +160,15 @@ class HunYuanSparseMoeBlock(nn.Module):
             reduce_results=False,
             renormalize=True if top_k > 1 else False,
             quant_config=quant_config,
+            prefix=f"{prefix}.experts",
         )
 
         self.gate = ReplicatedLinear(
-            config.hidden_size, config.num_experts, bias=False, quant_config=None
+            config.hidden_size,
+            config.num_experts,
+            bias=False,
+            quant_config=None,
+            prefix=f"{prefix}.gate",
         )
         if config.use_mixed_mlp_moe > 0:
             # Get layer_id num_shared_expert if config.num_shared_expert is a list
@@ -517,6 +523,7 @@ class HunYuanDecoderLayer(nn.Module):
             config=config,
             quant_config=quant_config,
             layer_id=layer_id,
+            prefix=f"{prefix}.mlp",
         )
         self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.post_attention_layernorm = RMSNorm(

xjpang · 2025-06-29T05:15:33Z

I also had the error that @xjpang got. Patched it by adding prefix to experts and gate modules

git diff

got it。 Thanks.

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

aiyiwang2025 · 2025-06-29T05:25:02Z

I also had the error that @xjpang got. Patched it by adding prefix to experts and gate modules
git diff

got it。 Thanks.

@xjpang In fact, I did not encounter the above problems. You can try the corresponding solutions @intervitens . I have also synchronized the corresponding code to this PR

xjpang · 2025-06-29T05:25:27Z

@aiyiwang2025 Can you provide reasoning and function calling parser?

aiyiwang2025 · 2025-06-29T05:30:09Z

@aiyiwang2025 Can you provide reasoning and function calling parser?

@xjpang reasoing parser is currently under development

tool call parser can refer to this

Isotr0py · 2025-06-29T05:34:50Z

vllm/model_executor/models/hunyuan_v1_moe.py

+        return final_hidden_states.view(orig_shape)
+
+
+class HunYuanAttention(nn.Module):


I think we can reuse attention and mlp implementation from hunyuan_v1_dense.py

Done. Remove duplicate logic in hunyuan_v1_moe.py

Isotr0py · 2025-06-29T05:37:15Z

vllm/model_executor/models/hunyuan_v1_moe.py

+        self.attn = Attention(
+            self.num_heads,
+            self.head_dim,
+            self.scaling,
+            num_kv_heads=self.num_kv_heads,
+            cache_config=cache_config,
+            quant_config=quant_config,
+            prefix=f"{prefix}.attn",
+        )


Suggested change

self.attn = Attention(

self.num_heads,

self.head_dim,

self.scaling,

num_kv_heads=self.num_kv_heads,

cache_config=cache_config,

quant_config=quant_config,

prefix=f"{prefix}.attn",

)

self.attn = Attention(

self.num_heads,

self.head_dim,

self.scaling,

num_kv_heads=self.num_kv_heads,

cache_config=cache_config,

quant_config=quant_config,

prefix=f"{prefix}.attn",

attn_type=AttentionType.ENCODER_DECODER,

)

We need to pass AttentionType.ENCODER_DECODER for cross attn.

Isotr0py · 2025-06-29T05:39:37Z

vllm/model_executor/models/hunyuan_v1_moe.py

+        attention_type = ("cross" if layer_id >= 0
+                          and layer_id % cla_factor != 0 else "self")


Suggested change

attention_type = ("cross" if layer_id >= 0

and layer_id % cla_factor != 0 else "self")

attention_type = (AttentionType.ENCODER_DECODER if layer_id >= 0

and layer_id % cla_factor != 0 else AttentionType.DECODER)

We can use vLLM's AttentionType enum here:

vllm/vllm/attention/backends/abstract.py

Lines 20 to 32 in 7b1895e

class AttentionType:

"""

Attention type.

Use string to be compatible with `torch.compile`.

"""

# Decoder attention between previous layer Q/K/V

DECODER = "decoder"

# Encoder attention between previous layer Q/K/V for encoder-decoder

ENCODER = "encoder"

# Encoder attention between previous layer Q/K/V

ENCODER_ONLY = "encoder_only"

# Attention between dec. Q and enc. K/V for encoder-decoder

ENCODER_DECODER = "encoder_decoder"

Isotr0py · 2025-06-29T05:41:11Z

vllm/model_executor/models/hunyuan_v1_dense.py

+                if "mlp.experts" in name:
+                    continue


Suggested change

if "mlp.experts" in name:

continue

Seems unnecessary for dense model.

Isotr0py · 2025-06-29T05:43:12Z

docs/models/supported_models.md

@@ -388,6 +388,7 @@ Specified using `--task generate`.
 | `MiniMaxM1ForCausalLM`                        | MiniMax-Text                                        | `MiniMaxAI/MiniMax-M1-40k`, `MiniMaxAI/MiniMax-M1-80k`etc.                                                                                                                                  |                        |                             |                       |
 | `MiniMaxText01ForCausalLM`                        | MiniMax-Text                                        | `MiniMaxAI/MiniMax-Text-01`, etc.                                                                                                                                            |                        |                             |                       |
 | `Zamba2ForCausalLM`                               | Zamba2                                              | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc.                                                                              |                        |                             |                       |
+| `HunYuanMoEV1ForCausalLM`                             | Hunyuan-80B-A13B                                 | `tencent/Hunyuan-A13B-Instruct`,  `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`etc.                                                                                                                                                   |                        |                          | ✅︎                     |


Can you also update the dense model in document? And seems that PP should also support too?

Isotr0py · 2025-06-29T05:43:35Z

tests/models/registry.py

@@ -259,6 +259,7 @@ def check_available_online(
    "Zamba2ForCausalLM": _HfExamplesInfo("Zyphra/Zamba2-7B-instruct"),
    "MiMoForCausalLM": _HfExamplesInfo("XiaomiMiMo/MiMo-7B-RL",
                                        trust_remote_code=True),
+    "HunYuanMoEV1ForCausalLM": _HfExamplesInfo("tencent/Hunyuan-A13B-Instruct"),


We need to register dense model here as well.

Done. Add HunYuanDenseV1ForCausalLM

Note:
We are currently working on some HF model governance. The architecture corresponding to the previously open Dense model is called HunYuanForCausalLM. The subsequent Dense model will be called HunYuanDenseV1ForCausalLM. If you want to run the previous model, you need to change the architecture. This PR does not include adaptation of the previous model.

Isotr0py · 2025-06-29T05:45:20Z

vllm/model_executor/models/hunyuan_v1_dense.py

+        is_neox_style = True
+        if quant_config is not None and quant_config.get_name() == "gguf":
+            is_neox_style = False


I think this is only for Llama GGUF models.

Done. Remove related code

Isotr0py · 2025-06-29T05:45:32Z

vllm/model_executor/models/hunyuan_v1_dense.py

+        is_neox_style = True
+        if quant_config is not None and quant_config.get_name() == "gguf":
+            is_neox_style = False


Done. Remove related code

Isotr0py · 2025-06-29T05:50:26Z

vllm/model_executor/models/hunyuan_v1_dense.py

+        split_params_mapping = [
+            (".gate_up_proj", ".gate_and_up_proj", 2, [(1, 1), (0, 1)], None),
+            (
+                ".qkv_proj",
+                ".qkv_proj",
+                num_attention_heads + num_kv_heads * 2,
+                [("q", num_attention_heads), ("k", num_kv_heads),
+                 ("v", num_kv_heads)],
+                self._split_qkv_weight,
+            ),
+        ]


Why do we have to split weights which have been stacked? I think we should be able to load it directly.

This is because of some historical reasons. The weights that were originally combined together need to be split and then spliced because the specific layout does not meet the requirements, so the code here needs to be retained

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

mergify · 2025-06-30T03:08:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aiyiwang2025.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee

Thank you for your contribution. Considering that I have confirmed with the PR author that the generated results can be aligned, let's merge this PR first. Related improvements can be completed in subsequent PRs

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

celsowm · 2025-06-30T14:40:05Z

hi ! will HunYuanConfig be included too ?

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Rafael Marcelino Koike <rafael.koike@oracle.com>

huachenheli · 2025-07-02T00:11:51Z

Looks like OpenAI-Compatible Tool Use test is failing due to this PR's change to rope scaling:
cc. @jeejeelee @Isotr0py

[2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519] EngineCore failed to start.
--
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519] Traceback (most recent call last):
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 510, in run_engine_core
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     engine_core = EngineCoreProc(*args, **kwargs)
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 394, in __init__
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     super().__init__(vllm_config, executor_class, log_stats,
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self.model_executor = executor_class(vllm_config)
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self._init_executor()
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self.collective_rpc("load_model")
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     answer = run_method(self.driver_worker, method, args, kwargs)
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2716, in run_method
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     return func(*args, **kwargs)
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self.model_runner.load_model()
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1793, in load_model
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self.model = model_loader.load_model(
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                  ^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     model = initialize_model(vllm_config=vllm_config,
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 65, in initialize_model
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     return model_class(vllm_config=vllm_config, prefix=prefix)
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internlm2.py", line 331, in __init__
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self.model = model_type(vllm_config=vllm_config,
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 152, in __init__
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internlm2.py", line 270, in __init__
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self.start_layer, self.end_layer, self.layers = make_layers(
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                                                     ^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 640, in make_layers
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internlm2.py", line 272, in <lambda>
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     lambda prefix: layer_type(
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                    ^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internlm2.py", line 203, in __init__
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self.attention = InternLM2Attention(
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                      ^^^^^^^^^^^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internlm2.py", line 134, in __init__
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     self.rotary_emb = get_rope(
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                       ^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/rotary_embedding.py", line 1967, in get_rope
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]     scaling_alpha = rope_scaling["alpha"]
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519]                     ~~~~~~~~~~~~^^^^^^^^^
  | [2025-07-01T21:39:01Z] ERROR 07-01 14:39:01 [core.py:519] KeyError: 'alpha'

DarkLight1337 · 2025-07-02T01:01:54Z

Should be fixed by #20343, sorry for merging this

Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

Signed-off-by: aiyiwang <aiyiwang@tencent.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: quinnrong <quinnrong@tencent.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

gemini-code-assist bot reviewed Jun 26, 2025

View reviewed changes

aiyiwang2025 requested a review from hmellor as a code owner June 26, 2025 07:33

mergify bot added the documentation Improvements or additions to documentation label Jun 26, 2025

aiyiwang2025 requested review from DarkLight1337 and ywang96 as code owners June 26, 2025 07:38

DarkLight1337 requested a review from Isotr0py June 26, 2025 07:40

Isotr0py reviewed Jun 26, 2025

View reviewed changes

jeejeelee reviewed Jun 26, 2025

View reviewed changes

aiyiwang2025 force-pushed the hunyuan_a13b branch 2 times, most recently from 864f1bc to 06c60c3 Compare June 26, 2025 10:05

Isotr0py reviewed Jun 26, 2025

View reviewed changes

support HunYuanMoEV1ForCausalLM

a8ce7ea

Co-authored-by: quinnrong <quinnrong@tencent.com> Signed-off-by: aiyiwang <aiyiwang@tencent.com>

aiyiwang2025 force-pushed the hunyuan_a13b branch from 8bf8ea6 to a8ce7ea Compare June 27, 2025 07:47

mgoin mentioned this pull request Jun 27, 2025

[Feature]: Support for Hunyuan-A13B-Instruct #20182

Closed

1 task

Downtown-Case mentioned this pull request Jun 28, 2025

model : add hunyuan moe ggml-org/llama.cpp#14425

Merged

4 tasks

deal with code format

2cdc7fe

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

Isotr0py reviewed Jun 29, 2025

View reviewed changes

aiyiwang2025 added 3 commits June 29, 2025 18:52

deal with comments

097b898

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

deal with comments

2b48071

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

resolve conflict

ddd2f51

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

mergify bot added the needs-rebase label Jun 30, 2025

resolve conflict

44f0b57

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

mergify bot removed the needs-rebase label Jun 30, 2025

aiyiwang2025 and others added 2 commits June 30, 2025 14:17

fixed typo

aeb3d61

Signed-off-by: aiyiwang <aiyiwang@tencent.com>

Fix format

d8bc477

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

mergify bot added the ci/build label Jun 30, 2025

Fix doc

9d09275

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee approved these changes Jun 30, 2025

View reviewed changes

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 30, 2025

jeejeelee enabled auto-merge (squash) June 30, 2025 13:24

jeejeelee added 2 commits June 30, 2025 13:53

Revert text.txt

35e66f2

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Add new line

cf81a2f

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Fix test

fa7b4cd

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee removed the ready ONLY add when PR is ready to merge/full CI is needed label Jul 1, 2025

jeejeelee added 2 commits July 1, 2025 08:50

remove dense model

9838cbc

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Add module block

f078320

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 1, 2025

vllm-bot merged commit ecad851 into vllm-project:main Jul 1, 2025
71 of 75 checks passed

DarkLight1337 mentioned this pull request Jul 2, 2025

[Bugfix] Fix dynamic rotary embedding #20343

Merged

4 tasks

	base = self.base * self.scaling_alpha ** (self.rotary_dim / (self.rotary_dim-2))
	base = self.base * (self.scaling_alpha ** (self.rotary_dim / (self.rotary_dim-2)))

		k_tmp = torch.empty_like(k) # Todo: reduant rotary embedding
		q, _ = self.rotary_emb(positions, q, k_tmp)

		"HunYuanDenseV1ForCausalLM": ("hunyuan_v1", "HunYuanV1ForCausalLM"),
		"HunYuanMoEV1ForCausalLM": ("hunyuan_v1", "HunYuanV1ForCausalLM"),

		return final_hidden_states.view(orig_shape)


		class HunYuanAttention(nn.Module):

		attention_type = ("cross" if layer_id >= 0
		and layer_id % cla_factor != 0 else "self")

	class AttentionType:
	"""
	Attention type.
	Use string to be compatible with `torch.compile`.
	"""
	# Decoder attention between previous layer Q/K/V
	DECODER = "decoder"
	# Encoder attention between previous layer Q/K/V for encoder-decoder
	ENCODER = "encoder"
	# Encoder attention between previous layer Q/K/V
	ENCODER_ONLY = "encoder_only"
	# Attention between dec. Q and enc. K/V for encoder-decoder
	ENCODER_DECODER = "encoder_decoder"

Uh oh!

[Model]Add Tencent HunYuanMoEV1 Model Support #20114

[Model]Add Tencent HunYuanMoEV1 Model Support #20114

Uh oh!

Conversation

aiyiwang2025 commented Jun 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 26, 2025

Choose a reason for hiding this comment

Style Guide References

Footnotes

Uh oh!

gemini-code-assist bot Jun 26, 2025

Choose a reason for hiding this comment

Style Guide References

Footnotes

Uh oh!

DarkLight1337 commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 26, 2025

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeejeelee Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xjpang commented Jun 28, 2025

Uh oh!

intervitens commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xjpang commented Jun 29, 2025

Uh oh!

aiyiwang2025 commented Jun 29, 2025

Uh oh!

aiyiwang2025 commented Jun 26, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Jun 26, 2025 •

edited

Loading

jeejeelee Jun 26, 2025 •

edited

Loading

intervitens commented Jun 28, 2025 •

edited

Loading

aiyiwang2025 commented Jun 29, 2025 •

edited

Loading