[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. #1123

kcz358 · 2024-08-16T12:03:33Z

This is a working PR and I'm putting here to see the changes and welcome any suggestions.

Motivation

With this PR, we wish to add support for LLaVA-OneVision. A new model that accepts multi-type inputs, including single-image with anyres_max_N and multi-image, and video.

Modification

add support for siglip
add support for qwen2 (nothing change from previous llava-next).
handle single-image with anyres_max_N (at max 729 * (N + 1) tokens) strategy, multi-image with base token strategy (729 tokens per image), video with bilinear interpolation token strategy (196 tokens per frame).
add openai compatible server for launching a backend service for llava-onevision model. for above mentioned multi-type inputs, we all use the same interface as OpenAI's GPT format as the input format (the image-text can also be interleaved).
Checklist
add openai compatible server for launching a backend service for llava-onevision model. for above mentioned multi-type inputs, we all use the same interface as OpenAI's GPT format as the input format (the image-text can also be interleaved).

Checklist

Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

Luodian · 2024-08-16T14:15:22Z

Sorry for sending the PR second time. We found it's hard to rebase our previous implementations for incorporating siglip and multi-image features into the sglang after v0.2. It will hang for no reason after prefill and decode, since we are unable to find the reasons after few days trial.

So we checked out from the latest main and re-implemented features. We should still acknowledge the contributions for original sglang implementations with the help of @kcz358 for adding siglip feature (and consistently debug/maintain this PR), Peiyuan Zhang for adding multi-image feature and @Luodian for debugging the token strategy for single-image/multi-image/video in llava-onevision.

Luodian · 2024-08-16T15:46:33Z

SGLang Performance (Image)

on AI2D, 10-20 input text tokens + 6-8k input visual tokens, 2-3 output tokens.

Model	LLaVA-OneVision-0.5B	LLaVA-OneVision-7B	LLaVA-OneVision-72B
LMMs-Eval LLaVA_Onevision	~5s/it	~3s/it	~3s/it
LMMs-Eval SRT-API (SGLang)	~5s/it	~5s/it	~1.5s/it

BS=1 TP=8

SGLang Performance (Video)

on VideoMME, 30-50 input text tokens + 32 * 196 input visual tokens, 2-3 output tokens.

Model	LLaVA-OneVision-0.5B	LLaVA-OneVision-7B	LLaVA-OneVision-72B
LMMs-Eval LLaVA_Onevision	~4s/it	~5s/it	~8.5s/it
LMMs-Eval SRT-API (SGLang)	~3.5s/it	~4s/it	~6s/it

BS=1 TP=8

Note: 0.5B and 7B number of heads can't be divided by 8, so it's served only on single GPU.

Luodian · 2024-08-16T15:51:53Z

Results

AI2D

LLaVA-OV-0.5B (original): 57.1%
LLaVA-OV-0.5B (srt-api, sglang): 56.6%

VideoMME (32 frames)

LLaVA-OV-0.5B (original): 44.0%
LLaVA-OV-0.5B (srt-api, sglang): 43.5%

kcz358 · 2024-08-17T02:08:21Z

More results

MME

LLaVA-OV-0.5B (original): 240/1238
LLaVA-OV-0.5B (srt-api, sglang): 240/1232

Notice : Due to the anyres_max_9 strategy, we might need to set --chunked-prefill-size 16384 when setting up the server otherwise the token numbers might exceed the default 8192

Luodian · 2024-08-18T13:56:50Z

Test in More Output Tokens Scenario

I tested the lmms-lab/llava-onevision-qwen2-72b-ov

python ./test/srt/test_llava_onevision_openai_server.py

Single Image

Prompt Tokens: 7186 (7150 visual + 36 text)
Completion Tokens: 487
Completion Tokens/Sec: 45.02

Video

Prompt Tokens: 6298 (6272 visual + 26 text)
Completion Tokens: 224
Completion Tokens/Sec: 26.41

But during the completion, I can see the log shows the gen throughput (tokens/s): 50-51 whether tested on image and video case.

Luodian · 2024-08-20T09:43:20Z

SGLang Performance (Image)

on AI2D, 10-20 input text tokens + 6-8k input visual tokens, 2-3 output tokens.

Model LLaVA-OneVision-0.5B LLaVA-OneVision-7B LLaVA-OneVision-72B
LMMs-Eval LLaVA_Onevision ~5s/it ~3s/it ~3s/it
LMMs-Eval SRT-API (SGLang) ~5s/it ~5s/it ~1.5s/it
BS=1 TP=8

SGLang Performance (Video)

on VideoMME, 30-50 input text tokens + 32 * 196 input visual tokens, 2-3 output tokens.

Model LLaVA-OneVision-0.5B LLaVA-OneVision-7B LLaVA-OneVision-72B
LMMs-Eval LLaVA_Onevision ~4s/it ~5s/it ~8.5s/it
LMMs-Eval SRT-API (SGLang) ~3.5s/it ~4s/it ~6s/it
BS=1 TP=8

Note: 0.5B and 7B number of heads can't be divided by 8, so it's served only on single GPU.

Note: I think this is not the best way to demonstrate the acceleration of sglang since both benchmarks only need the model to decode a few tokens. I will try llava_bench to better show the difference.

merrymercy · 2024-08-20T15:08:28Z

README.md

@@ -227,8 +227,13 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
  - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
  - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
  - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 30000`
+  - `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host=127.0.0.1 --tp-size=1 --chat-template=llava_llama_3`
+  - `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=8`


Is it possible to move the tokenizer files under the model repo? So we do not need to specify --tokenizer-path again

@kcz358 We solved this in llava-llama3-8b, can you keep trying if we can do this in llava-onevision? I think we can ignore the llava-next-72b.

Hi, I merge the tokenizers for all the models in llava-onevision, llava-next-72b and llava-llama3-8b and revise the commands now. Here are the test results

llava-llama3-8B

llava-next-72B

llava-onevision-72B

python/sglang/srt/managers/tokenizer_manager.py

python/sglang/srt/managers/tp_worker.py

test/srt/test_llava_onevision_openai_server.py

Luodian · 2024-08-21T10:19:28Z

Here's the result in more output tokens scenarios. We can see the speed up now.

videodc_499, ~6300 input tokens/~150 output tokens

Model	Inference Speed
LMMs-Eval LLaVA_Onevision	~46.7s/it
LMMs-Eval SRT_API (SGLang)	~18.3s/it

llava_in_the_wild, ~8000 input tokens/~100 output tokens

Model	Inference Speed
LMMs-Eval LLaVA_Onevision	~23.7s/it
LMMs-Eval SRT_API (SGLang)	~5.3s/it

The evaluation is conducted on lmms-eval, we observe the average request/second.
Baseline is llava_onevision,

python3 -m lmms_eval --model=llava_onevision \
    --model_args=pretrained=lmms-lab/llava-onevision-qwen2-72b-ov,conv_template=qwen_1_5,device_map=auto,model_name=llava_qwen \
    --tasks=$TASKS \
    --batch_size=1 \
    --log_samples \
    --log_samples_suffix=${BASE_NAME} \
    --output_path="./logs/"

SGlang is srt_api.

python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --tokenizer-path lmms-lab/llavanext-qwen-siglip-tokenizer --port=30000 --host=127.0.0.1 --tp-size=8 --chat-template=chatml-llava;

python3 -m lmms_eval \
    --model srt_api \
    --model_args modality=$MODALITY,host=127.0.0.1,port=30000,timeout=600,max_frames_num=$MAX_FRAMES_NUM \
    --tasks $TASK \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix $TASK_SUFFIX \
    --output_path ./logs/

README.md

python/sglang/srt/managers/tokenizer_manager.py

test/srt/test_vision_openai_server.py

python/sglang/srt/mm_utils.py

✨ feat(test_multi_image_openai_server.py): add functionality to download jobs.mp4 🔧 fix(test_multi_image_openai_server.py): update video path to use cached location

This commit removes the video file that is no longer needed in the project. Cleaning up unused assets helps to reduce clutter and improve project maintainability.

🔧 fix(test_multi_image_openai_server.py): update video URL to match new location This commit introduces a new feature in `srt_example_llava_v.py` that downloads a video file from a specified URL and saves it to the user's cache directory. This change enhances usability by ensuring that the required video file is readily available for processing. Additionally, the test file `test_multi_image_openai_server.py` is updated to reflect the new URL for the video file, ensuring that tests point to the correct resource location. This helps maintain the integrity of the tests and ensures they run successfully with the updated video source.

This change adds a try-except block around the code that retrieves the index of `image_token_index` in `input_ids`. If the index is not found, it defaults to 0 instead of raising an error. This improves the robustness of the code by ensuring that it can handle cases where the image token is not present in the input, preventing potential crashes during execution.

This commit adds new examples for launching the server with additional model paths, tokenizer paths, and parameters such as host and tp-size. These updates aim to provide clearer guidance for users on how to utilize the latest models and configurations, enhancing the overall documentation and user experience.

This commit refactors the image processing logic in the TokenizerManager class. The changes include: - Adding a new method `_process_single_image` to handle single image processing - Updating the `get_pixel_values` method to use the `_process_single_image` method for single image data - Simplifying the conditional statements for handling image data types These changes aim to improve the readability and maintainability of the code related to image processing in the TokenizerManager class.

Luodian · 2024-08-23T05:45:32Z

@merrymercy Can we get this PR merged now?

merrymercy · 2024-08-23T21:11:41Z

@Luodian @kcz358 Great work! It is merged now

…(2) qwen2 decoder (3) openai api compatible server. (sgl-project#1123) Co-authored-by: Bo Li <drluodian@gmail.com>

zhyncs requested review from Ying1123, merrymercy, zhyncs and hnyls2002 August 16, 2024 15:51

Ying1123 mentioned this pull request Aug 17, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

merrymercy requested changes Aug 20, 2024

View reviewed changes

Luodian mentioned this pull request Aug 21, 2024

Batch inference for LLaVA One Vision LLaVA-VL/LLaVA-NeXT#169

Open

merrymercy requested changes Aug 22, 2024

View reviewed changes

kcz358 and others added 15 commits August 22, 2024 10:27

Revise chat-ml in chat template

39e95ae

Bring test multi-images server from onevision

621c719

Add jobs.mp4

12a2280

Add mm_utils for llava onevision

a3e7379

Add llava onevision for tp_worker

6292a6e

Fix image_offset for multi_images

a248f8c

🔥 remove(jobs.mp4): delete the jobs.mp4 file from the assets directory

ab4083f

✨ feat(test_multi_image_openai_server.py): add functionality to download jobs.mp4 🔧 fix(test_multi_image_openai_server.py): update video path to use cached location

🔥 remove(videos): delete unused video file Q98Z4OTh8RwmDonc.mp4

ea7b4d1

This commit removes the video file that is no longer needed in the project. Cleaning up unused assets helps to reduce clutter and improve project maintainability.

update with correct video file from github content

5c0cc23

change temp to 0 for consistency.

dc7cd47

Update README.md

404e414

Luodian and others added 6 commits August 22, 2024 10:30

refactor: modify unittest for vision part

1183639

refactor: Improve image processing logic in TokenizerManager

7bf6680

refactor: update hash logic for handling list of hashs.

de6045c

Revise commands for llava-next and llava-onevision

2afe78f

Add docs for mm_utils

a79d383

Fix indentation messed up in tokenizer_manager

021e0d8

kcz358 force-pushed the dev/onevision_main branch from 777da44 to 021e0d8 Compare August 22, 2024 10:35

kcz358 added 5 commits August 22, 2024 10:44

Fix handling image_gridpinpoints error

820099e

Fix the order of the logic when handing pinpoints

4a351d4

Fix vision_openai_server unit test bug

f7c952e

Fix import issue in test_vision_openai_server

dc5f713

Add decord in to dependency

8fe2ac8

merrymercy merged commit a5b14ad into sgl-project:main Aug 23, 2024
5 checks passed

Ying1123 changed the title ~~[Feat/WIP] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server.~~ [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. Aug 24, 2024

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

[Feat/WIP] add llava-onevision, with support for (1) siglip encoder, …

9f80404

…(2) qwen2 decoder (3) openai api compatible server. (sgl-project#1123) Co-authored-by: Bo Li <drluodian@gmail.com>

[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. #1123

[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. #1123

Uh oh!

Conversation

kcz358 commented Aug 16, 2024

Motivation

Modification

Checklist

Uh oh!

Luodian commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Luodian commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SGLang Performance (Image)

SGLang Performance (Video)

Uh oh!

Luodian commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Uh oh!

kcz358 commented Aug 17, 2024

More results

MME

Uh oh!

Luodian commented Aug 18, 2024

Test in More Output Tokens Scenario

Uh oh!

Luodian commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SGLang Performance (Image)

SGLang Performance (Video)

Uh oh!

merrymercy Aug 20, 2024

Choose a reason for hiding this comment

Uh oh!

Luodian Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

kcz358 Aug 22, 2024

Choose a reason for hiding this comment

llava-llama3-8B

llava-next-72B

llava-onevision-72B

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luodian commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luodian commented Aug 23, 2024

Uh oh!

Uh oh!

merrymercy commented Aug 23, 2024

Uh oh!

Uh oh!

Luodian commented Aug 16, 2024 •

edited

Loading

Luodian commented Aug 16, 2024 •

edited

Loading

Luodian commented Aug 16, 2024 •

edited

Loading

Luodian commented Aug 20, 2024 •

edited

Loading

Luodian commented Aug 21, 2024 •

edited

Loading