-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. #1123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. #1123
Conversation
Sorry for sending the PR second time. We found it's hard to rebase our previous implementations for incorporating siglip and multi-image features into the sglang after So we checked out from the latest main and re-implemented features. We should still acknowledge the contributions for original sglang implementations with the help of @kcz358 for adding |
SGLang Performance (Image)
BS=1 TP=8 SGLang Performance (Video)
BS=1 TP=8 Note: 0.5B and 7B number of heads can't be divided by 8, so it's served only on single GPU. |
ResultsAI2D LLaVA-OV-0.5B (original): 57.1% VideoMME (32 frames) LLaVA-OV-0.5B (original): 44.0% |
More resultsMMELLaVA-OV-0.5B (original): 240/1238 Notice : Due to the anyres_max_9 strategy, we might need to set |
Test in More Output Tokens ScenarioI tested the
Single Image Prompt Tokens: 7186 (7150 visual + 36 text) Video Prompt Tokens: 6298 (6272 visual + 26 text) But during the completion, I can see the log shows the |
Note: I think this is not the best way to demonstrate the acceleration of sglang since both benchmarks only need the model to decode a few tokens. I will try |
README.md
Outdated
@@ -227,8 +227,13 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct | |||
- `python -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000` | |||
- `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000` | |||
- `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 30000` | |||
- `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --tokenizer-path lmms-lab/llama3-llava-next-8b-tokenizer --port=30000 --host=127.0.0.1 --tp-size=1 --chat-template=llava_llama_3` | |||
- `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30000 --host="127.0.0.1" --tp-size=8` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move the tokenizer files under the model repo? So we do not need to specify --tokenizer-path
again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kcz358 We solved this in llava-llama3-8b
, can you keep trying if we can do this in llava-onevision? I think we can ignore the llava-next-72b
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the result in more output tokens scenarios. We can see the speed up now.
The evaluation is conducted on python3 -m lmms_eval --model=llava_onevision \
--model_args=pretrained=lmms-lab/llava-onevision-qwen2-72b-ov,conv_template=qwen_1_5,device_map=auto,model_name=llava_qwen \
--tasks=$TASKS \
--batch_size=1 \
--log_samples \
--log_samples_suffix=${BASE_NAME} \
--output_path="./logs/" SGlang is python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --tokenizer-path lmms-lab/llavanext-qwen-siglip-tokenizer --port=30000 --host=127.0.0.1 --tp-size=8 --chat-template=chatml-llava;
python3 -m lmms_eval \
--model srt_api \
--model_args modality=$MODALITY,host=127.0.0.1,port=30000,timeout=600,max_frames_num=$MAX_FRAMES_NUM \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/ |
✨ feat(test_multi_image_openai_server.py): add functionality to download jobs.mp4 🔧 fix(test_multi_image_openai_server.py): update video path to use cached location
This commit removes the video file that is no longer needed in the project. Cleaning up unused assets helps to reduce clutter and improve project maintainability.
🔧 fix(test_multi_image_openai_server.py): update video URL to match new location This commit introduces a new feature in `srt_example_llava_v.py` that downloads a video file from a specified URL and saves it to the user's cache directory. This change enhances usability by ensuring that the required video file is readily available for processing. Additionally, the test file `test_multi_image_openai_server.py` is updated to reflect the new URL for the video file, ensuring that tests point to the correct resource location. This helps maintain the integrity of the tests and ensures they run successfully with the updated video source.
This change adds a try-except block around the code that retrieves the index of `image_token_index` in `input_ids`. If the index is not found, it defaults to 0 instead of raising an error. This improves the robustness of the code by ensuring that it can handle cases where the image token is not present in the input, preventing potential crashes during execution.
This commit adds new examples for launching the server with additional model paths, tokenizer paths, and parameters such as host and tp-size. These updates aim to provide clearer guidance for users on how to utilize the latest models and configurations, enhancing the overall documentation and user experience.
This commit refactors the image processing logic in the TokenizerManager class. The changes include: - Adding a new method `_process_single_image` to handle single image processing - Updating the `get_pixel_values` method to use the `_process_single_image` method for single image data - Simplifying the conditional statements for handling image data types These changes aim to improve the readability and maintainability of the code related to image processing in the TokenizerManager class.
777da44
to
021e0d8
Compare
@merrymercy Can we get this PR merged now? |
…(2) qwen2 decoder (3) openai api compatible server. (sgl-project#1123) Co-authored-by: Bo Li <drluodian@gmail.com>
This is a working PR and I'm putting here to see the changes and welcome any suggestions.
Motivation
With this PR, we wish to add support for LLaVA-OneVision. A new model that accepts multi-type inputs, including single-image with anyres_max_N and multi-image, and video.
Modification
add openai compatible server for launching a backend service for llava-onevision model. for above mentioned multi-type inputs, we all use the same interface as OpenAI's GPT format as the input format (the image-text can also be interleaved).
Checklist
Checklist