-
Notifications
You must be signed in to change notification settings - Fork 2.8k
vlm: adapt internvl to VisionAttention #6870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @mickqian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR focuses on integrating a new VisionAttention
layer into the InternVL model within the SGLang framework. The primary goal appears to be replacing the custom FlashAttention
implementation previously used in the InternVL vision encoder with a more standardized and potentially more flexible VisionAttention
layer. This involves significant changes to the InternVL model's attention mechanism, its forward pass, and how weights are loaded. Additionally, it introduces a new way to handle cumulative sequence lengths (cu_seqlens
) using a SingletonCache
and a cached utility function, and refactors a simple 'print once' utility.
Highlights
- Vision Attention Integration: The core change is replacing the existing attention mechanism in the InternVL vision encoder (
InternAttention
) with a newVisionAttention
layer, likely to leverage shared attention logic or optimizations. - Cumulative Sequence Length Handling: Introduces a
SingletonCache
and a cached function (_get_cu_seqlens_for_shape
) invision.py
to managecu_seqlens
, and updates the attention forward passes in bothvision.py
andinternvl.py
to utilize this new approach. - Weight Loading Adaptation: Modifies the
load_weights
method ininternvl.py
to correctly map parameter names from the original InternVL checkpoint to the new structure introduced by theVisionAttention
layer. - Code Refactoring: Removes a custom
FlashAttention
class and related logic frominternvl.py
and replaces a customexecute_once
/info_once
pattern with a standardfunctools.lru_cache
basedprint_info_once
inutils.py
.
Changelog
Click here to see the changelog
- python/sglang/srt/layers/attention/vision.py
- Added imports for
dataclasses
,functools
,Any
,Union
, andprint_info_once
(L3-7, L14). - Removed
wraps
import (L6). - Removed
logger
import (L34). - Removed custom
execute_once
andinfo_once
functions (L39-54). - Added
SingletonCache
dataclass for caching data (L41-52). - Added cached function
_get_cu_seqlens_for_shape
to generate cumulative sequence lengths (L56-69). - Modified
VisionSdpaAttention.forward
signature to acceptbsz
,seq_len
, andcu_seqlens
which can beSingletonCache
ortorch.Tensor
(L283-285). - Added logic in
VisionSdpaAttention.forward
to calculatecu_seqlens
ifNone
or retrieve/set it fromSingletonCache
(L294-301). - Updated QKV splitting logic in
VisionAttention.forward
to useself.q_size
andself.kv_size
for splitting (L459). - Adjusted the view shape calculation in
VisionAttention.forward
based onq_size
andkv_size
(L454). - Added
seq_len=s
argument to theself.qkv_backend
call inVisionAttention.forward
(L497). - Replaced calls to
info_once
withprint_info_once
inVisionAttention.__init__
(L374, L378).
- Added imports for
- python/sglang/srt/models/internvl.py
- Added imports for
Set
,SingletonCache
, andVisionAttention
(L14, L27). - Removed imports for
rearrange
andrepeat
(L22). - Removed the
FlashAttention
class (L43-107). - Modified
InternAttention.__init__
to acceptquant_config
(L46). - Replaced the
self.qkv
linear layer andself.inner_attn
with a singleself.attn
instance ofVisionAttention
(L56-66). - Removed the
_flash_attn
method (L132-152). - Modified
InternAttention.forward
to acceptcu_seqlens
and pass it toself.attn
(L79-81). - Modified
InternVisionEncoderLayer.forward
to acceptcu_seqlens
(L216). - Updated the call to
self.attn
withinInternVisionEncoderLayer.forward
to passcu_seqlens
(L229). - Added initialization of
cu_seqlens = SingletonCache()
inInternVisionModel.forward
(L298). - Updated the call to
encoder_layer
withinInternVisionModel.forward
to passcu_seqlens
(L303). - Added
loaded_params: Set[str] = set()
inInternVLChatModel.load_weights
(L560). - Added logic in
InternVLChatModel.load_weights
to adapt parameter names for the new attention structure (attn.
toattn.attn.
,qkv.
toqkv_proj.
) when loading vision model weights (L577-580). - Added
loaded_params.add(name)
after loading each parameter inInternVLChatModel.load_weights
(L606). - Added a check at the end of
InternVLChatModel.load_weights
to ensure all parameters were loaded (L607-611).
- Added imports for
- python/sglang/srt/utils.py
- Added
functools
import (L20). - Added
print_info_once
utility function usingfunctools.lru_cache(None)
(L1367-1369).
- Added
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Vision sees the world,
Attention focuses bright,
New layers connect.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the InternVL model to use a new, more general VisionAttention
module. This is a good step towards modularity and potentially allows leveraging different attention backends more easily. The changes include updates to import statements, removal of the older FlashAttention
class, and modifications to how cumulative sequence lengths (cu_seqlens
) are handled, including a new SingletonCache
and _get_cu_seqlens_for_shape
utility. The weight loading mechanism in InternVLChatModel
has also been improved to adapt to the new attention structure and to ensure all parameters are loaded.
While the refactoring is generally positive, there's a critical concern regarding QK normalization and a few medium-severity points to consider for improvement.
Summary of Findings
- Missing QK Normalization: The refactored
InternAttention
module ininternvl.py
appears to have dropped the QK normalization step that was present in the previous implementation. If this normalization is crucial for the InternVL model, this could lead to correctness or performance issues. - Device Handling for
cu_seqlens
: InVisionFlash3Attention.forward
, thecu_seqlens
tensor is moved to CUDA using.cuda()
. It might be more robust to use.to(q.device)
to ensure consistency with the query tensor's device, especially since_get_cu_seqlens_for_shape
already places it onq.device
. - Hardcoded Attention Configuration: The
InternAttention
module hardcodesqkv_backend="fa3"
anduse_qkv_parallel=True
. Making these configurable could offer greater flexibility for different deployment scenarios or future optimizations.
Merge Readiness
This pull request introduces a significant and largely positive refactoring of the attention mechanism for the InternVL model. However, there is a critical issue regarding the potential omission of QK normalization that must be addressed before merging. Additionally, there are a couple of medium-severity suggestions related to device handling and configuration flexibility that would improve the robustness and maintainability of the code.
I recommend that the critical issue be resolved and the medium-severity suggestions be considered. I am unable to approve this pull request in its current state; please have others review and approve these changes after the necessary modifications are made.
MMMU(batch_size == 1)
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adapts the internvl
model to utilize the new VisionAttention
layer. The changes involve refactoring the attention mechanism in vision.py
to support caching of cumulative sequence lengths using a custom SingletonCache
and lru_cache
, and updating the internvl.py
model code to integrate this new attention layer. The weight loading logic in internvl.py
is adjusted to map checkpoint keys to the new module structure, and a check for unloaded parameters is added for robustness. Overall, the changes appear to correctly integrate the new attention mechanism and improve the reliability of weight loading.
Motivation
Modifications
Checklist