-
Notifications
You must be signed in to change notification settings - Fork 580
feat: generalize VLM embedding extraction #1388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: generalize VLM embedding extraction #1388
Conversation
…-decouple-model-specific-logic
…-decouple-model-specific-logic
…-generalize-vlm-embedding-extraction
…-generalize-vlm-embedding-extraction
…-generalize-vlm-embedding-extraction
…-generalize-vlm-embedding-extraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
examples/multimodal/README.md (1)
64-70
: Generalize serve commands for multiple VLMs
The section now lists commands for LLaVA, Qwen2.5-VL, and Phi3V, addressing prior feedback on removing LLaVA-specific references. Please ensure config path notation is consistent across sections (e.g., using or omitting./
uniformly).
🧹 Nitpick comments (2)
examples/multimodal/README.md (2)
103-104
: DRY the model-replacement note
The guidance to replace the"model"
field for Qwen and Phi3V appears here; consider extracting it into a single reusable note or referencing the earlier section to avoid duplication.
263-264
: Reword repeated note and avoid redundancy
The sentence starting with "If serving the example Qwen model…" is repeated here. Consider rephrasing or referencing the primary note to improve readability and avoid repeated sentence openings.🧰 Tools
🪛 LanguageTool
[style] ~263-~263: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with"Qwen/Qwen2.5-VL-7B-Instruct"
. If serving the example Phi3V model, replac...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/multimodal/README.md
(5 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/multimodal/README.md
[style] ~263-~263: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct"
. If serving the example Phi3V model, replac...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
🔇 Additional comments (3)
examples/multimodal/README.md (3)
98-98
: Add temperature parameter to client payload
Including"temperature": 0.0
aligns with the new multimodal request protocol and provides users control over inference sampling.
216-221
: Generalize deploy commands for multiple VLMs
The deployment section now includes LLaVA, Qwen2.5-VL, and Phi3V configurations, matching the PR objective to support multiple vision-language models.
258-258
: Add temperature parameter to deployment test payload
Consistency with the client example is maintained by adding"temperature": 0.0
to the deployment testing JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to add a note to the README that disagg is not yet supported.
Otherwise, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
… github.com:ai-dynamo/dynamo into hannahz/dep-114-generalize-vlm-embedding-extraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/multimodal/README.md (1)
181-181
: Fix model naming consistency in disaggregated note
The note references “Qwen VL” and “PhiV”, which should match “Qwen2.5-VL” and “Phi3V” used elsewhere for consistency.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/multimodal/README.md
(6 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/multimodal/README.md
[style] ~265-~265: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct"
. If serving the example Phi3V model, replac...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - vllm
🔇 Additional comments (7)
examples/multimodal/README.md (7)
64-69
: Serve commands updated for multi-model support
Aggregation examples now include LLaVA, Qwen2.5-VL, and Phi3V configurations, improving flexibility and demonstrating generalized VLM support.
98-98
: Added temperature parameter to aggregated client request
The inclusion of"temperature": 0.0
enables finer control over response randomness.
103-104
: Model replacement instructions updated for aggregated examples
Clear guidance for swapping in Qwen2.5-VL or Phi3V models enhances usability.
171-171
: Added temperature parameter to disaggregated client request
Consistent with aggregated examples, the temperature field was added to control inference behavior.
218-223
: Deployment commands extended for multi-model configurations
Aggregation deployment now supports LLaVA, Qwen2.5-VL, and Phi3V via distinct YAML files, mirroring serve examples.
260-260
: Added temperature parameter to deployment test request
Including"temperature": 0.0
in the test aligns with updated example payloads and demonstrates end-to-end usage.
265-266
: Model replacement instructions updated for deployment test
The guidance to swap in Qwen2.5-VL or Phi3V models is clear and matches the aggregated client examples.🧰 Tools
🪛 LanguageTool
[style] ~265-~265: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with"Qwen/Qwen2.5-VL-7B-Instruct"
. If serving the example Phi3V model, replac...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Overview:
Details:
LlavaForConditionalGeneration
with helper methods likeload_vision_model
andget_multimodal_embeddings
for better flexibility and maintainabilityembeddings_shape
instead of hardcoding itimage-token-id
andnum-patches
are now configurable in config YAMLtemperature
support for inference callsDemo:
2025-06-09-dynamo-multimodal-generalized-embedding-llava-qwen.mp4
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Improvements
Documentation
Chores