vlm: adapt internvl to VisionAttention #6870

mickqian · 2025-06-04T16:02:20Z

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @mickqian, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR focuses on integrating a new VisionAttention layer into the InternVL model within the SGLang framework. The primary goal appears to be replacing the custom FlashAttention implementation previously used in the InternVL vision encoder with a more standardized and potentially more flexible VisionAttention layer. This involves significant changes to the InternVL model's attention mechanism, its forward pass, and how weights are loaded. Additionally, it introduces a new way to handle cumulative sequence lengths (cu_seqlens) using a SingletonCache and a cached utility function, and refactors a simple 'print once' utility.

Highlights

Vision Attention Integration: The core change is replacing the existing attention mechanism in the InternVL vision encoder (InternAttention) with a new VisionAttention layer, likely to leverage shared attention logic or optimizations.
Cumulative Sequence Length Handling: Introduces a SingletonCache and a cached function (_get_cu_seqlens_for_shape) in vision.py to manage cu_seqlens, and updates the attention forward passes in both vision.py and internvl.py to utilize this new approach.
Weight Loading Adaptation: Modifies the load_weights method in internvl.py to correctly map parameter names from the original InternVL checkpoint to the new structure introduced by the VisionAttention layer.
Code Refactoring: Removes a custom FlashAttention class and related logic from internvl.py and replaces a custom execute_once/info_once pattern with a standard functools.lru_cache based print_info_once in utils.py.

Changelog

Click here to see the changelog

python/sglang/srt/layers/attention/vision.py
- Added imports for dataclasses, functools, Any, Union, and print_info_once (L3-7, L14).
- Removed wraps import (L6).
- Removed logger import (L34).
- Removed custom execute_once and info_once functions (L39-54).
- Added SingletonCache dataclass for caching data (L41-52).
- Added cached function _get_cu_seqlens_for_shape to generate cumulative sequence lengths (L56-69).
- Modified VisionSdpaAttention.forward signature to accept bsz, seq_len, and cu_seqlens which can be SingletonCache or torch.Tensor (L283-285).
- Added logic in VisionSdpaAttention.forward to calculate cu_seqlens if None or retrieve/set it from SingletonCache (L294-301).
- Updated QKV splitting logic in VisionAttention.forward to use self.q_size and self.kv_size for splitting (L459).
- Adjusted the view shape calculation in VisionAttention.forward based on q_size and kv_size (L454).
- Added seq_len=s argument to the self.qkv_backend call in VisionAttention.forward (L497).
- Replaced calls to info_once with print_info_once in VisionAttention.__init__ (L374, L378).
python/sglang/srt/models/internvl.py
- Added imports for Set, SingletonCache, and VisionAttention (L14, L27).
- Removed imports for rearrange and repeat (L22).
- Removed the FlashAttention class (L43-107).
- Modified InternAttention.__init__ to accept quant_config (L46).
- Replaced the self.qkv linear layer and self.inner_attn with a single self.attn instance of VisionAttention (L56-66).
- Removed the _flash_attn method (L132-152).
- Modified InternAttention.forward to accept cu_seqlens and pass it to self.attn (L79-81).
- Modified InternVisionEncoderLayer.forward to accept cu_seqlens (L216).
- Updated the call to self.attn within InternVisionEncoderLayer.forward to pass cu_seqlens (L229).
- Added initialization of cu_seqlens = SingletonCache() in InternVisionModel.forward (L298).
- Updated the call to encoder_layer within InternVisionModel.forward to pass cu_seqlens (L303).
- Added loaded_params: Set[str] = set() in InternVLChatModel.load_weights (L560).
- Added logic in InternVLChatModel.load_weights to adapt parameter names for the new attention structure (attn. to attn.attn., qkv. to qkv_proj.) when loading vision model weights (L577-580).
- Added loaded_params.add(name) after loading each parameter in InternVLChatModel.load_weights (L606).
- Added a check at the end of InternVLChatModel.load_weights to ensure all parameters were loaded (L607-611).
python/sglang/srt/utils.py
- Added functools import (L20).
- Added print_info_once utility function using functools.lru_cache(None) (L1367-1369).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Vision sees the world,
Attention focuses bright,
New layers connect.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the InternVL model to use a new, more general VisionAttention module. This is a good step towards modularity and potentially allows leveraging different attention backends more easily. The changes include updates to import statements, removal of the older FlashAttention class, and modifications to how cumulative sequence lengths (cu_seqlens) are handled, including a new SingletonCache and _get_cu_seqlens_for_shape utility. The weight loading mechanism in InternVLChatModel has also been improved to adapt to the new attention structure and to ensure all parameters are loaded.

While the refactoring is generally positive, there's a critical concern regarding QK normalization and a few medium-severity points to consider for improvement.

Summary of Findings

Missing QK Normalization: The refactored InternAttention module in internvl.py appears to have dropped the QK normalization step that was present in the previous implementation. If this normalization is crucial for the InternVL model, this could lead to correctness or performance issues.
Device Handling for cu_seqlens: In VisionFlash3Attention.forward, the cu_seqlens tensor is moved to CUDA using .cuda(). It might be more robust to use .to(q.device) to ensure consistency with the query tensor's device, especially since _get_cu_seqlens_for_shape already places it on q.device.
Hardcoded Attention Configuration: The InternAttention module hardcodes qkv_backend="fa3" and use_qkv_parallel=True. Making these configurable could offer greater flexibility for different deployment scenarios or future optimizations.

Merge Readiness

This pull request introduces a significant and largely positive refactoring of the attention mechanism for the InternVL model. However, there is a critical issue regarding the potential omission of QK normalization that must be addressed before merging. Additionally, there are a couple of medium-severity suggestions related to device handling and configuration flexibility that would improve the robustness and maintainability of the code.

I recommend that the critical issue be resolved and the medium-severity suggestions be considered. I am unable to approve this pull request in its current state; please have others review and approve these changes after the necessary modifications are made.

python/sglang/srt/layers/attention/vision.py

python/sglang/srt/models/internvl.py

mickqian · 2025-06-04T16:17:16Z

MMMU(batch_size == 1)

	accuracy	time(s)
before	0.528	252.0
pr	0.528	245.1

python/sglang/srt/layers/attention/vision.py

mickqian · 2025-06-11T07:01:19Z

/gemini review

gemini-code-assist

Code Review

This pull request adapts the internvl model to utilize the new VisionAttention layer. The changes involve refactoring the attention mechanism in vision.py to support caching of cumulative sequence lengths using a custom SingletonCache and lru_cache, and updating the internvl.py model code to integrate this new attention layer. The weight loading logic in internvl.py is adjusted to map checkpoint keys to the new module structure, and a check for unloaded parameters is added for robustness. Overall, the changes appear to correctly integrate the new attention mechanism and improve the reliability of weight loading.

python/sglang/srt/layers/attention/vision.py

python/sglang/srt/models/internvl.py

mickqian requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu, zhaochenyang20, HaiShaw, ch-wan and BBuf as code owners June 4, 2025 16:02

mickqian marked this pull request as draft June 4, 2025 16:02

gemini-code-assist bot reviewed Jun 4, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 4, 2025

View reviewed changes

python/sglang/srt/layers/attention/vision.py Outdated Show resolved Hide resolved

python/sglang/srt/models/internvl.py Show resolved Hide resolved

mickqian marked this pull request as ready for review June 4, 2025 16:20

zhyncs assigned lifuhuang, CatherineSue and JustinTong0323 Jun 5, 2025

zhyncs temporarily deployed to prod June 5, 2025 21:01 — with GitHub Actions Inactive

JustinTong0323 reviewed Jun 6, 2025

View reviewed changes

python/sglang/srt/layers/attention/vision.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/attention/vision.py Show resolved Hide resolved

python/sglang/srt/layers/attention/vision.py Show resolved Hide resolved

mickqian force-pushed the internVL branch from d04017f to d64acca Compare June 6, 2025 08:24

mickqian temporarily deployed to prod June 6, 2025 08:24 — with GitHub Actions Inactive

zhyncs temporarily deployed to prod June 7, 2025 01:20 — with GitHub Actions Inactive

mickqian force-pushed the internVL branch from d4c464c to 097d236 Compare June 8, 2025 08:54

mickqian temporarily deployed to prod June 8, 2025 08:54 — with GitHub Actions Inactive

mickqian requested a review from xiezhq-hermann as a code owner June 8, 2025 10:37

mickqian temporarily deployed to prod June 8, 2025 10:37 — with GitHub Actions Inactive

JustinTong0323 temporarily deployed to prod June 9, 2025 00:21 — with GitHub Actions Inactive

vlm: adapt internvl to VisionAttention

ce8f8c9

mickqian force-pushed the internVL branch from cbe0230 to ce8f8c9 Compare June 11, 2025 06:01

gemini-code-assist bot reviewed Jun 11, 2025

View reviewed changes

python/sglang/srt/layers/attention/vision.py Show resolved Hide resolved

python/sglang/srt/models/internvl.py Show resolved Hide resolved

zhyncs approved these changes Jun 11, 2025

View reviewed changes

zhyncs merged commit 83d8768 into sgl-project:main Jun 11, 2025
88 of 99 checks passed

jianan-gu pushed a commit to jianan-gu/sglang that referenced this pull request Jun 12, 2025

vlm: adapt internvl to VisionAttention (sgl-project#6870)

b96c90b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vlm: adapt internvl to VisionAttention #6870

vlm: adapt internvl to VisionAttention #6870

Uh oh!

mickqian commented Jun 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

mickqian commented Jun 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian commented Jun 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlm: adapt internvl to VisionAttention #6870

vlm: adapt internvl to VisionAttention #6870

Uh oh!

Conversation

mickqian commented Jun 4, 2025

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

mickqian commented Jun 4, 2025

MMMU(batch_size == 1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian commented Jun 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!