Skip to content

Conversation

zixi-qi
Copy link
Collaborator

@zixi-qi zixi-qi commented Jun 12, 2025

Purpose

In the case when eagle head needs to share vocab embedding from target model and target model has multimodal support: eagle head should load the vocab embedding from the language model of the target model.

Test Plan & Result

VLLM_USE_V1=1 python examples/offline_inference/eagle.py  --num_spec_tokens 7 --max_num_seqs 1 --num_prompts 5 --method eagle --enforce_eager

--------------------------------------------------
mean acceptance length: 2.62
--------------------------------------------------
acceptance at token 0:0.73
acceptance at token 1:0.44
acceptance at token 2:0.25
acceptance at token 3:0.13
acceptance at token 4:0.07
acceptance at token 5:0.01
acceptance at token 6:0.00
VLLM_USE_V1=1 python examples/offline_inference/eagle.py  --num_spec_tokens 7 --max_num_seqs

mean acceptance length: 4.15
--------------------------------------------------
acceptance at token 0:0.78
acceptance at token 1:0.63
acceptance at token 2:0.54
acceptance at token 3:0.37
acceptance at token 4:0.36
acceptance at token 5:0.27
acceptance at token 6:0.21

pytest -v tests/v1/spec_decode/test_eagle.py
pytest -v tests/v1/e2e/test_spec_decode.py

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @zixi-qi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fixes a bug in the EAGLE speculative decoding implementation where, in the case of a multimodal target model, the EAGLE head was not correctly identifying the language model component for sharing vocabulary embeddings and the language model head. The changes ensure that the EAGLE head interacts with the language-specific part of the multimodal model for weight sharing, resolving the issue and improving compatibility with multimodal architectures.

Highlights

  • Multimodal Target Model Compatibility: Ensured that when the EAGLE head shares weights (like embeddings and the LM head) with a multimodal target model, it correctly identifies and uses the language model component of the target model.
  • Embedding Sharing Logic: Updated the logic for sharing the embed_tokens layer to compare shapes and assign the embedding layer from the language model part of the target model when dealing with multimodal models.
  • LM Head Sharing Logic: Updated the logic for sharing the lm_head layer to consistently use the language model part of the target model, simplifying the conditional logic.
  • Image Token Index Handling: Added logic to copy the image_token_index from the target model's config to the EAGLE model's config if the target model is multimodal and has this attribute.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the v1 label Jun 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request fixes an issue where the EAGLE head was not correctly loading vocabulary embeddings for multimodal target models. The changes ensure that the EAGLE head interacts with the language model component of a multimodal target model for shared embeddings and LM head. The solution seems sound and improves compatibility with multimodal models.

Comment on lines 333 to 336
# handle multimodality
if hasattr(target_model.config, "image_token_index"):
self.model.config.image_token_index = (
target_model.config.image_token_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider extracting this logic into a separate helper function for better readability and maintainability.

def _copy_image_token_index(target_model: nn.Module, model_config: PretrainedConfig) -> None:
    if hasattr(target_model.config, "image_token_index"):
        model_config.image_token_index = target_model.config.image_token_index

if supports_multimodal(target_model):
    _copy_image_token_index(target_model, self.model.config)

# share embed_tokens with the target model if needed
if get_pp_group().world_size == 1 \
and self.model.model.embed_tokens.weight.shape \
== target_model.model.embed_tokens.weight.shape:
== target_language_model.model.embed_tokens.weight.shape:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a check to ensure that target_language_model.model.embed_tokens is not None before accessing its weight attribute to avoid potential AttributeError exceptions.

if target_language_model.model.embed_tokens and self.model.model.embed_tokens.weight.shape == target_language_model.model.embed_tokens.weight.shape:

@@ -349,12 +358,9 @@ def load_model(self, target_model: nn.Module) -> None:
# some model definition do not define lm_head explicitly
# and reuse embed_tokens for lm_head, e.g., CohereForCausalLM
if self.vllm_config.speculative_config.method != "eagle3" and \
hasattr(target_model, "lm_head"):
hasattr(target_language_model, "lm_head"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a check to ensure that target_language_model has the attribute lm_head and that it is not None before accessing it to avoid potential AttributeError exceptions.

if hasattr(target_language_model, "lm_head") and target_language_model.lm_head is not None:

@zixi-qi zixi-qi requested a review from morgendave June 12, 2025 17:20
@WoosukKwon
Copy link
Collaborator

cc @benchislett

@@ -329,16 +329,25 @@ def load_model(self, target_model: nn.Module) -> None:

self.attn_layer_names = list(draft_attn_layer_names)

if supports_multimodal(target_model):
# handle multimodality
if hasattr(target_model.config, "image_token_index"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it not sufficient to check "supports_multimodal"? If this is just double-checking that the target model indeed supports multimodal, shouldn't it be incorporated into the supports_multimodal function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense supports_multimodal should be sufficient, updated the PR and thanks for the quick review!

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2025
@WoosukKwon
Copy link
Collaborator

@benchislett Thanks for reviewing this PR so quick!

Signed-off-by: qizixi <qizixi@meta.com>
Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. If we can add some unittest, that will be better. (it's okay to add in the following PR though)

@houseroad
Copy link
Collaborator

Please fix the linter as well?

@houseroad
Copy link
Collaborator

Actually, it's like an infra error, re-run it now.

@aarnphm aarnphm merged commit c68698b into vllm-project:main Jun 13, 2025
64 of 65 checks passed
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zixi-qi zixi-qi deleted the eagle-mm-refactor branch June 13, 2025 05:55
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
…project#19570)

Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
…project#19570)

Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants