Skip to content

Conversation

kcz358
Copy link
Contributor

@kcz358 kcz358 commented Aug 16, 2024

This is a working PR and I'm putting here to see the changes and welcome any suggestions.

Motivation

With this PR, we wish to add support for LLaVA-OneVision. A new model that accepts multi-type inputs, including single-image with anyres_max_N and multi-image, and video.

Modification

  1. add support for siglip
  2. add support for qwen2 (nothing change from previous llava-next).
  3. handle single-image with anyres_max_N (at max 729 * (N + 1) tokens) strategy, multi-image with base token strategy (729 tokens per image), video with bilinear interpolation token strategy (196 tokens per frame).
    add openai compatible server for launching a backend service for llava-onevision model. for above mentioned multi-type inputs, we all use the same interface as OpenAI's GPT format as the input format (the image-text can also be interleaved).
    Checklist
  4. add openai compatible server for launching a backend service for llava-onevision model. for above mentioned multi-type inputs, we all use the same interface as OpenAI's GPT format as the input format (the image-text can also be interleaved).

Checklist

  1. Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  2. Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  3. Modify documentation as needed, such as docstrings or example tutorials.

@Luodian
Copy link
Contributor

Luodian commented Aug 16, 2024

Sorry for sending the PR second time. We found it's hard to rebase our previous implementations for incorporating siglip and multi-image features into the sglang after v0.2. It will hang for no reason after prefill and decode, since we are unable to find the reasons after few days trial.

So we checked out from the latest main and re-implemented features. We should still acknowledge the contributions for original sglang implementations with the help of @kcz358 for adding siglip feature (and consistently debug/maintain this PR), Peiyuan Zhang for adding multi-image feature and @Luodian for debugging the token strategy for single-image/multi-image/video in llava-onevision.

@Luodian
Copy link
Contributor

Luodian commented Aug 16, 2024

SGLang Performance (Image)

on AI2D, 10-20 input text tokens + 6-8k input visual tokens, 2-3 output tokens.

Model LLaVA-OneVision-0.5B LLaVA-OneVision-7B LLaVA-OneVision-72B
LMMs-Eval LLaVA_Onevision ~5s/it ~3s/it ~3s/it
LMMs-Eval SRT-API (SGLang) ~5s/it ~5s/it ~1.5s/it

BS=1      TP=8

SGLang Performance (Video)

on VideoMME, 30-50 input text tokens + 32 * 196 input visual tokens, 2-3 output tokens.

Model LLaVA-OneVision-0.5B LLaVA-OneVision-7B LLaVA-OneVision-72B
LMMs-Eval LLaVA_Onevision ~4s/it ~5s/it ~8.5s/it
LMMs-Eval SRT-API (SGLang) ~3.5s/it ~4s/it ~6s/it

BS=1      TP=8

Note: 0.5B and 7B number of heads can't be divided by 8, so it's served only on single GPU.

@Luodian
Copy link
Contributor

Luodian commented Aug 16, 2024

Results

AI2D

LLaVA-OV-0.5B (original): 57.1%
LLaVA-OV-0.5B (srt-api, sglang): 56.6%

VideoMME (32 frames)

LLaVA-OV-0.5B (original): 44.0%
LLaVA-OV-0.5B (srt-api, sglang): 43.5%

@kcz358
Copy link
Contributor Author

kcz358 commented Aug 17, 2024

More results

MME

LLaVA-OV-0.5B (original): 240/1238
LLaVA-OV-0.5B (srt-api, sglang): 240/1232

Notice : Due to the anyres_max_9 strategy, we might need to set --chunked-prefill-size 16384 when setting up the server otherwise the token numbers might exceed the default 8192

@Ying1123 Ying1123 mentioned this pull request Aug 17, 2024
29 tasks
@Luodian
Copy link
Contributor

Luodian commented Aug 18, 2024

Test in More Output Tokens Scenario

I tested the lmms-lab/llava-onevision-qwen2-72b-ov

python ./test/srt/test_llava_onevision_openai_server.py

Single Image

Prompt Tokens: 7186 (7150 visual + 36 text)
Completion Tokens: 487
Completion Tokens/Sec: 45.02

Video

Prompt Tokens: 6298 (6272 visual + 26 text)
Completion Tokens: 224
Completion Tokens/Sec: 26.41

But during the completion, I can see the log shows the gen throughput (tokens/s): 50-51 whether tested on image and video case.

@Luodian
Copy link
Contributor

Luodian commented Aug 20, 2024

SGLang Performance (Image)

on AI2D, 10-20 input text tokens + 6-8k input visual tokens, 2-3 output tokens.

Model LLaVA-OneVision-0.5B LLaVA-OneVision-7B LLaVA-OneVision-72B
LMMs-Eval LLaVA_Onevision ~5s/it ~3s/it ~3s/it
LMMs-Eval SRT-API (SGLang) ~5s/it ~5s/it ~1.5s/it
BS=1      TP=8

SGLang Performance (Video)

on VideoMME, 30-50 input text tokens + 32 * 196 input visual tokens, 2-3 output tokens.

Model LLaVA-OneVision-0.5B LLaVA-OneVision-7B LLaVA-OneVision-72B
LMMs-Eval LLaVA_Onevision ~4s/it ~5s/it ~8.5s/it
LMMs-Eval SRT-API (SGLang) ~3.5s/it ~4s/it ~6s/it
BS=1      TP=8

Note: 0.5B and 7B number of heads can't be divided by 8, so it's served only on single GPU.

Note: I think this is not the best way to demonstrate the acceleration of sglang since both benchmarks only need the model to decode a few tokens. I will try llava_bench to better show the difference.

README.md Outdated
@@ -227,8 +227,13 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- `python -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
- `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
- `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 30000`
- `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host=127.0.0.1 --tp-size=1 --chat-template=llava_llama_3`
- `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=8`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move the tokenizer files under the model repo? So we do not need to specify --tokenizer-path again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcz358 We solved this in llava-llama3-8b, can you keep trying if we can do this in llava-onevision? I think we can ignore the llava-next-72b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I merge the tokenizers for all the models in llava-onevision, llava-next-72b and llava-llama3-8b and revise the commands now. Here are the test results

llava-llama3-8B

image

llava-next-72B

image

llava-onevision-72B

image

@Luodian
Copy link
Contributor

Luodian commented Aug 21, 2024

Here's the result in more output tokens scenarios. We can see the speed up now.

  1. videodc_499, ~6300 input tokens/~150 output tokens
Model Inference Speed
LMMs-Eval LLaVA_Onevision ~46.7s/it
LMMs-Eval SRT_API (SGLang) ~18.3s/it
  1. llava_in_the_wild, ~8000 input tokens/~100 output tokens
Model Inference Speed
LMMs-Eval LLaVA_Onevision ~23.7s/it
LMMs-Eval SRT_API (SGLang) ~5.3s/it

The evaluation is conducted on lmms-eval, we observe the average request/second.
Baseline is llava_onevision,

python3 -m lmms_eval --model=llava_onevision \
    --model_args=pretrained=lmms-lab/llava-onevision-qwen2-72b-ov,conv_template=qwen_1_5,device_map=auto,model_name=llava_qwen \
    --tasks=$TASKS \
    --batch_size=1 \
    --log_samples \
    --log_samples_suffix=${BASE_NAME} \
    --output_path="./logs/"

SGlang is srt_api.

python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --tokenizer-path lmms-lab/llavanext-qwen-siglip-tokenizer --port=30000 --host=127.0.0.1 --tp-size=8 --chat-template=chatml-llava;

python3 -m lmms_eval \
    --model srt_api \
    --model_args modality=$MODALITY,host=127.0.0.1,port=30000,timeout=600,max_frames_num=$MAX_FRAMES_NUM \
    --tasks $TASK \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix $TASK_SUFFIX \
    --output_path ./logs/

kcz358 and others added 15 commits August 22, 2024 10:27
✨ feat(test_multi_image_openai_server.py): add functionality to download jobs.mp4
🔧 fix(test_multi_image_openai_server.py): update video path to use cached location
This commit removes the video file that is no longer needed in the project.
Cleaning up unused assets helps to reduce clutter and improve project maintainability.
🔧 fix(test_multi_image_openai_server.py): update video URL to match new location

This commit introduces a new feature in `srt_example_llava_v.py` that
downloads a video file from a specified URL and saves it to the user's
cache directory. This change enhances usability by ensuring that the
required video file is readily available for processing.

Additionally, the test file `test_multi_image_openai_server.py` is
updated to reflect the new URL for the video file, ensuring that tests
point to the correct resource location. This helps maintain the integrity
of the tests and ensures they run successfully with the updated video
source.
This change adds a try-except block around the code that retrieves the
index of `image_token_index` in `input_ids`. If the index is not found,
it defaults to 0 instead of raising an error. This improves the robustness
of the code by ensuring that it can handle cases where the image token
is not present in the input, preventing potential crashes during execution.
This commit adds new examples for launching the server with
additional model paths, tokenizer paths, and parameters such as
host and tp-size. These updates aim to provide clearer guidance
for users on how to utilize the latest models and configurations,
enhancing the overall documentation and user experience.
This commit refactors the image processing logic in the TokenizerManager class. The changes include:
- Adding a new method `_process_single_image` to handle single image processing
- Updating the `get_pixel_values` method to use the `_process_single_image` method for single image data
- Simplifying the conditional statements for handling image data types

These changes aim to improve the readability and maintainability of the code related to image processing in the TokenizerManager class.
@kcz358 kcz358 force-pushed the dev/onevision_main branch from 777da44 to 021e0d8 Compare August 22, 2024 10:35
@Luodian
Copy link
Contributor

Luodian commented Aug 23, 2024

@merrymercy Can we get this PR merged now?

@merrymercy merrymercy merged commit a5b14ad into sgl-project:main Aug 23, 2024
5 checks passed
@merrymercy
Copy link
Contributor

@Luodian @kcz358 Great work! It is merged now

@Ying1123 Ying1123 changed the title [Feat/WIP] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. Aug 24, 2024
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
…(2) qwen2 decoder (3) openai api compatible server. (sgl-project#1123)

Co-authored-by: Bo Li <drluodian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants