model: Intern vl 2.5 #3351

mickqian · 2025-02-06T16:14:43Z

Motivation

Support InternVL2_5, as requested in #3092

Modifications

InternVLChatModel

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

mickqian · 2025-02-06T16:15:07Z

depends on #3203

Titan-p · 2025-02-20T15:11:27Z

python/sglang/srt/managers/image_processor.py

+
+        all_frames = []
+
+        def load_image_internvl(image_file, input_size=448, max_num=12):


Do we need to introduce parameters for max/min numbers passed from the API, similar to what is done in lmdeploy?

internvl.md

dict(type='image_url', image_url=dict(max_dynamic_patch=12, url='https://raw.githubusercontent.com/OpenGVLab/InternVL/main/internvl_chat/examples/image1.jpg')),

Is that parameter frequently used? If not, and other vl models don't support that parameter, this might be of low-priority

Perhaps the user wants to control the limit on the size of image slices per request?

Titan-p · 2025-02-20T15:25:50Z

python/sglang/srt/models/internvl.py

+        self.vision_model = InternVisionModel(
+            config=config.vision_config, quant_config=quant_config
+        )
+        self.language_model = InternLM2ForCausalLM(


Add support to qwen/llama/.. language_model.

Titan-p · 2025-02-21T03:23:28Z

Edit
I believe I have identified the issue: it's related to the order of the image token and the user input.
Acording to InternVL2_5 example,

question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

the <image> token should be front of the user input.

diff --git a/python/sglang/srt/conversation.py b/python/sglang/srt/conversation.py
index 65eefca3..cae45883 100644
--- a/python/sglang/srt/conversation.py
+++ b/python/sglang/srt/conversation.py
@@ -440,7 +440,10 @@ def generate_chat_conv(
                         real_content += content.text
                     elif content.type == "image_url":
                         # NOTE: Only works for llava
-                        real_content += image_token
+                        if conv.name == "internvl2_5":
+                            real_content = image_token + real_content
+                        else:
+                            real_content += image_token
                         conv.append_image(content.image_url.url)
                 conv.append_message(conv.roles[0], real_content)
         elif msg_role == "assistant":

This modification will work.

It appears there are some discrepancies with the inference results from lmdeploy. SGLang is unable to generate Chinese output ~~particularly in multi-image scenarios.~~
Request:

{
    "model": "/home/base-data/largeModel/internalAccess/InternVL2_5-2B",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "中文描述两张图片内容"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "/home/panlyu/images/004.jpg"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "/home/panlyu/images/005.jpg"
                    }
                }
            ]
        }
    ]
}

Result

SGLang
- "message": { "role": "assistant", "content": "The image on the left is a meme featuring a character from an animated film holding a dog, while the image on the right is an image of a dog lying down. The text in the meme on the left is \"JUST..MONDAY.\" The image on the right is captioned with the text \"MONDAY.\"", "tool_calls": null },
lmdeploy 0.7.0.post3
- "message": { "role": "assistant", "content": "这两张图片是一张恶搞图，将动画角色柴犬（Shiba Inu）与现实生活中的著名科技公司创始人之一埃隆·马斯克的形象结合在一起。\n\n第一张图片上，柴犬被描绘成举着一只小柴犬，仿佛在庆祝某个事件。背景是晴朗的天空，给人一种轻松愉快的感觉。\n\n第二张图片上，一只法国斗牛犬懒散地躺在地板上，似乎在打盹，背景也是一片蓝色的地板。图片上方和下方分别有文字：“MONDAY. JUST..MONDAY.”（星期一，只是...星期一。）\n\n这两张图片通过将柴犬与马斯克的形象结合，形成了一种幽默的对比。柴犬通常形象活泼可爱，而法国斗牛犬则通常被描绘成慵懒、放荡不羁。马斯克的形象则以严肃和认真著称，这使得这张恶搞图显得非常滑稽，也传达了一种对生活的调侃。", "tool_calls": null },
Images I used:

zhaochenyang20 · 2025-03-09T04:04:34Z

This LGTM, but too large. I think decouple should be good.

zhaochenyang20 · 2025-04-06T01:46:49Z

@mickqian Why we close this? Are there any follow up?

mickqian force-pushed the internVL branch 2 times, most recently from 6cf5f01 to 506dda1 Compare February 7, 2025 13:48

mickqian marked this pull request as ready for review February 9, 2025 08:24

mickqian requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners February 9, 2025 08:24

mickqian changed the title ~~feature: Intern vl 2.5~~ model: Intern vl 2.5 Feb 20, 2025

Titan-p reviewed Feb 20, 2025

View reviewed changes

Titan-p approved these changes Feb 21, 2025

View reviewed changes

mickqian force-pushed the internVL branch 2 times, most recently from f4de116 to a7734d7 Compare February 21, 2025 05:12

model: InternVL 2.5

6cfa5a1

mickqian force-pushed the internVL branch from a7734d7 to 6cfa5a1 Compare February 21, 2025 10:28

merrymercy requested a review from HaiShaw as a code owner March 3, 2025 08:12

zhaochenyang20 mentioned this pull request Mar 4, 2025

Development Roadmap (2025 H1) #4042

Open

67 tasks

merrymercy requested a review from zhaochenyang20 as a code owner March 8, 2025 06:12

mickqian marked this pull request as draft March 9, 2025 09:51

Fxycst1213 mentioned this pull request Mar 13, 2025

support_Internvl2.5 #4373

Closed

6 tasks

hehesangsj mentioned this pull request Mar 14, 2025

Support InternVL2.5 #4433

Closed

6 tasks

Fxycst1213 mentioned this pull request Mar 17, 2025

Internvl and MMMU dataset evaluation #4509

Closed

6 tasks

mickqian closed this Mar 21, 2025

xiaomin-D mentioned this pull request Apr 13, 2025

Support InternVL3 #5350

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model: Intern vl 2.5 #3351

model: Intern vl 2.5 #3351

Uh oh!

mickqian commented Feb 6, 2025 •

edited

Loading

Uh oh!

mickqian commented Feb 6, 2025

Uh oh!

Titan-p Feb 20, 2025

Uh oh!

mickqian Feb 21, 2025

Uh oh!

Titan-p Feb 21, 2025

Uh oh!

Titan-p Feb 20, 2025

Uh oh!

Titan-p commented Feb 21, 2025 •

edited

Loading

Uh oh!

zhaochenyang20 commented Mar 9, 2025

Uh oh!

zhaochenyang20 commented Apr 6, 2025

Uh oh!

Uh oh!


		all_frames = []

		def load_image_internvl(image_file, input_size=448, max_num=12):

model: Intern vl 2.5 #3351

model: Intern vl 2.5 #3351

Uh oh!

Conversation

mickqian commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

mickqian commented Feb 6, 2025

Uh oh!

Titan-p Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

mickqian Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Titan-p Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Titan-p Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

Titan-p commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Mar 9, 2025

Uh oh!

zhaochenyang20 commented Apr 6, 2025

Uh oh!

Uh oh!

mickqian commented Feb 6, 2025 •

edited

Loading

Titan-p commented Feb 21, 2025 •

edited

Loading