Skip to content

Conversation

mickqian
Copy link
Collaborator

@mickqian mickqian commented Jun 4, 2025

Motivation

Modifications

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mickqian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR focuses on integrating a new VisionAttention layer into the InternVL model within the SGLang framework. The primary goal appears to be replacing the custom FlashAttention implementation previously used in the InternVL vision encoder with a more standardized and potentially more flexible VisionAttention layer. This involves significant changes to the InternVL model's attention mechanism, its forward pass, and how weights are loaded. Additionally, it introduces a new way to handle cumulative sequence lengths (cu_seqlens) using a SingletonCache and a cached utility function, and refactors a simple 'print once' utility.

Highlights

  • Vision Attention Integration: The core change is replacing the existing attention mechanism in the InternVL vision encoder (InternAttention) with a new VisionAttention layer, likely to leverage shared attention logic or optimizations.
  • Cumulative Sequence Length Handling: Introduces a SingletonCache and a cached function (_get_cu_seqlens_for_shape) in vision.py to manage cu_seqlens, and updates the attention forward passes in both vision.py and internvl.py to utilize this new approach.
  • Weight Loading Adaptation: Modifies the load_weights method in internvl.py to correctly map parameter names from the original InternVL checkpoint to the new structure introduced by the VisionAttention layer.
  • Code Refactoring: Removes a custom FlashAttention class and related logic from internvl.py and replaces a custom execute_once/info_once pattern with a standard functools.lru_cache based print_info_once in utils.py.

Changelog

Click here to see the changelog
  • python/sglang/srt/layers/attention/vision.py
    • Added imports for dataclasses, functools, Any, Union, and print_info_once (L3-7, L14).
    • Removed wraps import (L6).
    • Removed logger import (L34).
    • Removed custom execute_once and info_once functions (L39-54).
    • Added SingletonCache dataclass for caching data (L41-52).
    • Added cached function _get_cu_seqlens_for_shape to generate cumulative sequence lengths (L56-69).
    • Modified VisionSdpaAttention.forward signature to accept bsz, seq_len, and cu_seqlens which can be SingletonCache or torch.Tensor (L283-285).
    • Added logic in VisionSdpaAttention.forward to calculate cu_seqlens if None or retrieve/set it from SingletonCache (L294-301).
    • Updated QKV splitting logic in VisionAttention.forward to use self.q_size and self.kv_size for splitting (L459).
    • Adjusted the view shape calculation in VisionAttention.forward based on q_size and kv_size (L454).
    • Added seq_len=s argument to the self.qkv_backend call in VisionAttention.forward (L497).
    • Replaced calls to info_once with print_info_once in VisionAttention.__init__ (L374, L378).
  • python/sglang/srt/models/internvl.py
    • Added imports for Set, SingletonCache, and VisionAttention (L14, L27).
    • Removed imports for rearrange and repeat (L22).
    • Removed the FlashAttention class (L43-107).
    • Modified InternAttention.__init__ to accept quant_config (L46).
    • Replaced the self.qkv linear layer and self.inner_attn with a single self.attn instance of VisionAttention (L56-66).
    • Removed the _flash_attn method (L132-152).
    • Modified InternAttention.forward to accept cu_seqlens and pass it to self.attn (L79-81).
    • Modified InternVisionEncoderLayer.forward to accept cu_seqlens (L216).
    • Updated the call to self.attn within InternVisionEncoderLayer.forward to pass cu_seqlens (L229).
    • Added initialization of cu_seqlens = SingletonCache() in InternVisionModel.forward (L298).
    • Updated the call to encoder_layer within InternVisionModel.forward to pass cu_seqlens (L303).
    • Added loaded_params: Set[str] = set() in InternVLChatModel.load_weights (L560).
    • Added logic in InternVLChatModel.load_weights to adapt parameter names for the new attention structure (attn. to attn.attn., qkv. to qkv_proj.) when loading vision model weights (L577-580).
    • Added loaded_params.add(name) after loading each parameter in InternVLChatModel.load_weights (L606).
    • Added a check at the end of InternVLChatModel.load_weights to ensure all parameters were loaded (L607-611).
  • python/sglang/srt/utils.py
    • Added functools import (L20).
    • Added print_info_once utility function using functools.lru_cache(None) (L1367-1369).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Vision sees the world,
Attention focuses bright,
New layers connect.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the InternVL model to use a new, more general VisionAttention module. This is a good step towards modularity and potentially allows leveraging different attention backends more easily. The changes include updates to import statements, removal of the older FlashAttention class, and modifications to how cumulative sequence lengths (cu_seqlens) are handled, including a new SingletonCache and _get_cu_seqlens_for_shape utility. The weight loading mechanism in InternVLChatModel has also been improved to adapt to the new attention structure and to ensure all parameters are loaded.

While the refactoring is generally positive, there's a critical concern regarding QK normalization and a few medium-severity points to consider for improvement.

Summary of Findings

  • Missing QK Normalization: The refactored InternAttention module in internvl.py appears to have dropped the QK normalization step that was present in the previous implementation. If this normalization is crucial for the InternVL model, this could lead to correctness or performance issues.
  • Device Handling for cu_seqlens: In VisionFlash3Attention.forward, the cu_seqlens tensor is moved to CUDA using .cuda(). It might be more robust to use .to(q.device) to ensure consistency with the query tensor's device, especially since _get_cu_seqlens_for_shape already places it on q.device.
  • Hardcoded Attention Configuration: The InternAttention module hardcodes qkv_backend="fa3" and use_qkv_parallel=True. Making these configurable could offer greater flexibility for different deployment scenarios or future optimizations.

Merge Readiness

This pull request introduces a significant and largely positive refactoring of the attention mechanism for the InternVL model. However, there is a critical issue regarding the potential omission of QK normalization that must be addressed before merging. Additionally, there are a couple of medium-severity suggestions related to device handling and configuration flexibility that would improve the robustness and maintainability of the code.

I recommend that the critical issue be resolved and the medium-severity suggestions be considered. I am unable to approve this pull request in its current state; please have others review and approve these changes after the necessary modifications are made.

@mickqian
Copy link
Collaborator Author

mickqian commented Jun 4, 2025

MMMU(batch_size == 1)

accuracy time(s)
before 0.528 252.0
pr 0.528 245.1

@mickqian mickqian marked this pull request as ready for review June 4, 2025 16:20
@mickqian
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adapts the internvl model to utilize the new VisionAttention layer. The changes involve refactoring the attention mechanism in vision.py to support caching of cumulative sequence lengths using a custom SingletonCache and lru_cache, and updating the internvl.py model code to integrate this new attention layer. The weight loading logic in internvl.py is adjusted to map checkpoint keys to the new module structure, and a check for unloaded parameters is added for robustness. Overall, the changes appear to correctly integrate the new attention mechanism and improve the reliability of weight loading.

@zhyncs zhyncs merged commit 83d8768 into sgl-project:main Jun 11, 2025
88 of 99 checks passed
jianan-gu pushed a commit to jianan-gu/sglang that referenced this pull request Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants