Skip to content

Conversation

FrankLeeeee
Copy link
Collaborator

@FrankLeeeee FrankLeeeee commented Feb 13, 2025

Motivation

This PR aims to fix the issue #3545 by enhancing the engine with the support of vision language models such as Qwen2-VL for offline inference.

Modifications

First of all, it should be noted that the design of current code has some issues which make this PR an imperfect solution, as a result, I have not added any documentation or unit test yet and wish to look for discussion on how to improve the code as a whole.

The root causes of #3545 are that:

  1. VLMs require the prompt to be pre-processed by the chat template. This is because that the processor will add in some image-related tokens in the prompt. For exmaple, Qwen2-VL adds the following tokens to the prompt: <|vision_start|><|image_pad|><|vision_end|>. Thus, VLMs are different from LLMs in the sense that the chat template is a must for VLM but not always for LLMs (LLMs can still generate sensible outputs even without the template).
  2. sgl.Engine is not responsible for applying the chat template to the prompts
  3. the current API for applying the chat template is only for online serving, i.e. v1_chat_generate_request

As a result, the current proposed workflow for running VLMs offline is shown below:
WechatIMG469

It is not so elegant because it counter-intuitively uses a API used for online serving in the offline inference scenario. I would suggest we extract the preprocessing logic to independent APIs or create a new API for preprocessing in offline cases.

Open for discussion.

Checklist

@yizhang2077
Copy link
Collaborator

LGTM, cc @zhaochenyang20 @merrymercy

@zhyncs zhyncs merged commit fb4c9c3 into sgl-project:main Feb 14, 2025
@FrankLeeeee FrankLeeeee deleted the hotfix/offline-vlm branch February 15, 2025 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants