feat: generalize VLM embedding extraction #1388

hhzhang16 · 2025-06-04T23:40:58Z

Overview:

Replaced LLaVA-specific code with modular, model-agnostic functions and variables
Replaced some hardcoded values with configurable parameters
Refined handling of embeddings and prompts
Add agg support for Phi3V, Qwen2-VL, and Qwen2.5-VL

Details:

Replaced LlavaForConditionalGeneration with helper methods like load_vision_model and get_multimodal_embeddings for better flexibility and maintainability
Dynamically determine embeddings_shape instead of hardcoding it
Added config agg YAML files for Qwen2.5-VL and Phi3V. Also supports Qwen2-VL if you replace the model name in agg-qwen.yaml
image-token-id and num-patches are now configurable in config YAML
Added prompt template validation
Added temperature support for inference calls

Demo:

2025-06-09-dynamo-multimodal-generalized-embedding-llava-qwen.mp4

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added support for specifying image token ID and number of patches in multimodal configurations and CLI arguments.
- Introduced new configuration files for Qwen2.5-VL-7B-Instruct and Phi-3.5-vision-instruct multimodal model setups.
- Added temperature control and image dimension metadata to multimodal requests and responses.
- Included new utility functions for vision model loading, embedding info retrieval, and multimodal data construction.
Improvements
- Enhanced validation and error handling in multimodal request processing.
- Refactored workers to use dynamic shapes and configurable parameters for image embeddings.
- Improved logging and debugging information in multimodal components.
- Updated example commands and prompts to support multiple multimodal models with clear instructions.
Documentation
- Updated README instructions and example requests to reflect new configuration usage and parameters.
Chores
- Added the pynvml package to dependencies.

…-decouple-model-specific-logic

…-generalize-vlm-embedding-extraction

examples/multimodal/components/decode_worker.py

examples/llm/configs/disagg.yaml

…raction

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

examples/multimodal/README.md (1)

64-70: Generalize serve commands for multiple VLMs
The section now lists commands for LLaVA, Qwen2.5-VL, and Phi3V, addressing prior feedback on removing LLaVA-specific references. Please ensure config path notation is consistent across sections (e.g., using or omitting ./ uniformly).

🧹 Nitpick comments (2)

examples/multimodal/README.md (2)

103-104: DRY the model-replacement note
The guidance to replace the "model" field for Qwen and Phi3V appears here; consider extracting it into a single reusable note or referencing the earlier section to avoid duplication.

263-264: Reword repeated note and avoid redundancy
The sentence starting with "If serving the example Qwen model…" is repeated here. Consider rephrasing or referencing the primary note to improve readability and avoid repeated sentence openings.

🧰 Tools

🪛 LanguageTool

[style] ~263-~263: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct". If serving the example Phi3V model, replac...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d586343 and 1b0efc0.

📒 Files selected for processing (1)

examples/multimodal/README.md (5 hunks)

🧰 Additional context used

🪛 LanguageTool

examples/multimodal/README.md

[style] ~263-~263: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct". If serving the example Phi3V model, replac...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: Mirror Repository to GitLab
GitHub Check: Build and Test - vllm

🔇 Additional comments (3)

examples/multimodal/README.md (3)

98-98: Add temperature parameter to client payload
Including "temperature": 0.0 aligns with the new multimodal request protocol and provides users control over inference sampling.

216-221: Generalize deploy commands for multiple VLMs
The deployment section now includes LLaVA, Qwen2.5-VL, and Phi3V configurations, matching the PR objective to support multiple vision-language models.

258-258: Add temperature parameter to deployment test payload
Consistency with the client example is maintained by adding "temperature": 0.0 to the deployment testing JSON.

whoisj

We might want to add a note to the README that disagg is not yet supported.

Otherwise, LGTM.

krishung5

Nice work!

… github.com:ai-dynamo/dynamo into hannahz/dep-114-generalize-vlm-embedding-extraction

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/multimodal/README.md (1)

181-181: Fix model naming consistency in disaggregated note
The note references “Qwen VL” and “PhiV”, which should match “Qwen2.5-VL” and “Phi3V” used elsewhere for consistency.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0efc0 and d12e86d.

📒 Files selected for processing (1)

examples/multimodal/README.md (6 hunks)

🧰 Additional context used

🪛 LanguageTool

examples/multimodal/README.md

[style] ~265-~265: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct". If serving the example Phi3V model, replac...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: Mirror Repository to GitLab
GitHub Check: Build and Test - vllm

🔇 Additional comments (7)

examples/multimodal/README.md (7)

64-69: Serve commands updated for multi-model support
Aggregation examples now include LLaVA, Qwen2.5-VL, and Phi3V configurations, improving flexibility and demonstrating generalized VLM support.

98-98: Added temperature parameter to aggregated client request
The inclusion of "temperature": 0.0 enables finer control over response randomness.

103-104: Model replacement instructions updated for aggregated examples
Clear guidance for swapping in Qwen2.5-VL or Phi3V models enhances usability.

171-171: Added temperature parameter to disaggregated client request
Consistent with aggregated examples, the temperature field was added to control inference behavior.

218-223: Deployment commands extended for multi-model configurations
Aggregation deployment now supports LLaVA, Qwen2.5-VL, and Phi3V via distinct YAML files, mirroring serve examples.

260-260: Added temperature parameter to deployment test request
Including "temperature": 0.0 in the test aligns with updated example payloads and demonstrates end-to-end usage.

265-266: Model replacement instructions updated for deployment test
The guidance to swap in Qwen2.5-VL or Phi3V models is clear and matches the aggregated client examples.

🧰 Tools

🪛 LanguageTool

[style] ~265-~265: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct". If serving the example Phi3V model, replac...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

container/deps/requirements.txt

hhzhang16 added 5 commits June 4, 2025 15:23

feat: add more robust handling for MM prompt

8d67b7d

feat: [WIP] generalize workers

b65efb5

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-113…

40c2154

…-decouple-model-specific-logic

feat: remove cls token

e13f827

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-113…

a866d73

…-decouple-model-specific-logic

pull-request-size bot added the size/L label Jun 4, 2025

copy-pr-bot bot temporarily deployed to GITLAB June 4, 2025 23:41 Inactive

github-actions bot added the feat label Jun 4, 2025

copy-pr-bot bot temporarily deployed to GITLAB June 4, 2025 23:41 Inactive

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-114…

0adb7e6

…-generalize-vlm-embedding-extraction

copy-pr-bot bot temporarily deployed to GITLAB June 5, 2025 23:07 Inactive

hhzhang16 added 4 commits June 5, 2025 16:10

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-114…

86c6135

…-generalize-vlm-embedding-extraction

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-114…

19f2158

…-generalize-vlm-embedding-extraction

feat: working multimodal agg for multiple vision models

a766509

Merge branch 'main' of github.com:ai-dynamo/dynamo into hannahz/dep-114…

17aecda

…-generalize-vlm-embedding-extraction

copy-pr-bot bot temporarily deployed to GITLAB June 7, 2025 00:18 Inactive

copy-pr-bot bot temporarily deployed to GITLAB June 7, 2025 00:19 Inactive

hhzhang16 marked this pull request as ready for review June 7, 2025 00:26

hhzhang16 requested review from indrajit96, krishung5, whoisj, rmccorm4, tanmayv25, richardhuo-nv, ptarasiewiczNV, ishandhanani, alec-flowers, nnshah1 and a team as code owners June 7, 2025 00:26

krishung5 reviewed Jun 11, 2025

View reviewed changes

examples/multimodal/components/decode_worker.py Show resolved Hide resolved

krishung5 reviewed Jun 11, 2025

View reviewed changes

examples/llm/configs/disagg.yaml Outdated Show resolved Hide resolved

docs: fixing dash issue

d5025a7

copy-pr-bot bot temporarily deployed to GITLAB June 11, 2025 17:19 Inactive

Merge branch 'main' into hannahz/dep-114-generalize-vlm-embedding-ext…

1b0efc0

…raction

copy-pr-bot bot temporarily deployed to GITLAB June 11, 2025 17:19 Inactive

coderabbitai bot reviewed Jun 11, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB June 11, 2025 17:24 Inactive

whoisj approved these changes Jun 11, 2025

View reviewed changes

docs: add readme note about disagg support

843d586

krishung5 approved these changes Jun 11, 2025

View reviewed changes

Merge branch 'hannahz/dep-114-generalize-vlm-embedding-extraction' of…

d12e86d

… github.com:ai-dynamo/dynamo into hannahz/dep-114-generalize-vlm-embedding-extraction

copy-pr-bot bot temporarily deployed to GITLAB June 11, 2025 17:54 Inactive

coderabbitai bot reviewed Jun 11, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB June 11, 2025 18:01 Inactive

alec-flowers requested changes Jun 11, 2025

View reviewed changes

container/deps/requirements.txt Outdated Show resolved Hide resolved

feat: remove pynvml from this MR

073ad67

copy-pr-bot bot temporarily deployed to GITLAB June 11, 2025 20:15 Inactive

copy-pr-bot bot temporarily deployed to GITLAB June 11, 2025 20:17 Inactive

alec-flowers approved these changes Jun 11, 2025

View reviewed changes

hhzhang16 merged commit e924a7c into main Jun 11, 2025
10 checks passed

hhzhang16 deleted the hannahz/dep-114-generalize-vlm-embedding-extraction branch June 11, 2025 21:42

hhzhang16 mentioned this pull request Jun 12, 2025

[FEATURE]: Multimodal Model Support #1496

Open

coderabbitai bot mentioned this pull request Jun 12, 2025

feat: Video support with Dynamo #1443

Merged

hhzhang16 self-assigned this Jun 13, 2025

This was referenced Jul 17, 2025

feat: add vLLM v1 multi-modal example. Add llama4 Maverick example #1990

Merged

feat: multi-modal example with vLLM v1 and UX v2 #2040

Merged

This was referenced Jul 26, 2025

feat: Deploy SLA planner to Kubernetes #2135

Merged

fix: support config without resource limit for profile sla script #2201

Merged

coderabbitai bot mentioned this pull request Aug 25, 2025

feat: Add vllm multimodal qwen aggregated support #2694

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: generalize VLM embedding extraction #1388

feat: generalize VLM embedding extraction #1388

Uh oh!

hhzhang16 commented Jun 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

whoisj left a comment

Uh oh!

krishung5 left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: generalize VLM embedding extraction #1388

feat: generalize VLM embedding extraction #1388

Uh oh!

Conversation

hhzhang16 commented Jun 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

whoisj left a comment

Choose a reason for hiding this comment

Uh oh!

krishung5 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhzhang16 commented Jun 4, 2025 •

edited by coderabbitai bot

Loading