Skip to content

Conversation

hhzhang16
Copy link
Contributor

@hhzhang16 hhzhang16 commented Jun 4, 2025

Overview:

  • Replaced LLaVA-specific code with modular, model-agnostic functions and variables
  • Replaced some hardcoded values with configurable parameters
  • Refined handling of embeddings and prompts
  • Add agg support for Phi3V, Qwen2-VL, and Qwen2.5-VL

Details:

  • Replaced LlavaForConditionalGeneration with helper methods like load_vision_model and get_multimodal_embeddings for better flexibility and maintainability
  • Dynamically determine embeddings_shape instead of hardcoding it
  • Added config agg YAML files for Qwen2.5-VL and Phi3V. Also supports Qwen2-VL if you replace the model name in agg-qwen.yaml
  • image-token-id and num-patches are now configurable in config YAML
  • Added prompt template validation
  • Added temperature support for inference calls

Demo:

2025-06-09-dynamo-multimodal-generalized-embedding-llava-qwen.mp4

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added support for specifying image token ID and number of patches in multimodal configurations and CLI arguments.
    • Introduced new configuration files for Qwen2.5-VL-7B-Instruct and Phi-3.5-vision-instruct multimodal model setups.
    • Added temperature control and image dimension metadata to multimodal requests and responses.
    • Included new utility functions for vision model loading, embedding info retrieval, and multimodal data construction.
  • Improvements

    • Enhanced validation and error handling in multimodal request processing.
    • Refactored workers to use dynamic shapes and configurable parameters for image embeddings.
    • Improved logging and debugging information in multimodal components.
    • Updated example commands and prompts to support multiple multimodal models with clear instructions.
  • Documentation

    • Updated README instructions and example requests to reflect new configuration usage and parameters.
  • Chores

    • Added the pynvml package to dependencies.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
examples/multimodal/README.md (1)

64-70: Generalize serve commands for multiple VLMs
The section now lists commands for LLaVA, Qwen2.5-VL, and Phi3V, addressing prior feedback on removing LLaVA-specific references. Please ensure config path notation is consistent across sections (e.g., using or omitting ./ uniformly).

🧹 Nitpick comments (2)
examples/multimodal/README.md (2)

103-104: DRY the model-replacement note
The guidance to replace the "model" field for Qwen and Phi3V appears here; consider extracting it into a single reusable note or referencing the earlier section to avoid duplication.


263-264: Reword repeated note and avoid redundancy
The sentence starting with "If serving the example Qwen model…" is repeated here. Consider rephrasing or referencing the primary note to improve readability and avoid repeated sentence openings.

🧰 Tools
🪛 LanguageTool

[style] ~263-~263: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct". If serving the example Phi3V model, replac...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d586343 and 1b0efc0.

📒 Files selected for processing (1)
  • examples/multimodal/README.md (5 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/multimodal/README.md

[style] ~263-~263: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct". If serving the example Phi3V model, replac...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (3)
examples/multimodal/README.md (3)

98-98: Add temperature parameter to client payload
Including "temperature": 0.0 aligns with the new multimodal request protocol and provides users control over inference sampling.


216-221: Generalize deploy commands for multiple VLMs
The deployment section now includes LLaVA, Qwen2.5-VL, and Phi3V configurations, matching the PR objective to support multiple vision-language models.


258-258: Add temperature parameter to deployment test payload
Consistency with the client example is maintained by adding "temperature": 0.0 to the deployment testing JSON.

Copy link
Collaborator

@whoisj whoisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add a note to the README that disagg is not yet supported.

Otherwise, LGTM.

Copy link
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

… github.com:ai-dynamo/dynamo into hannahz/dep-114-generalize-vlm-embedding-extraction
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
examples/multimodal/README.md (1)

181-181: Fix model naming consistency in disaggregated note
The note references “Qwen VL” and “PhiV”, which should match “Qwen2.5-VL” and “Phi3V” used elsewhere for consistency.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0efc0 and d12e86d.

📒 Files selected for processing (1)
  • examples/multimodal/README.md (6 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/multimodal/README.md

[style] ~265-~265: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct". If serving the example Phi3V model, replac...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (7)
examples/multimodal/README.md (7)

64-69: Serve commands updated for multi-model support
Aggregation examples now include LLaVA, Qwen2.5-VL, and Phi3V configurations, improving flexibility and demonstrating generalized VLM support.


98-98: Added temperature parameter to aggregated client request
The inclusion of "temperature": 0.0 enables finer control over response randomness.


103-104: Model replacement instructions updated for aggregated examples
Clear guidance for swapping in Qwen2.5-VL or Phi3V models enhances usability.


171-171: Added temperature parameter to disaggregated client request
Consistent with aggregated examples, the temperature field was added to control inference behavior.


218-223: Deployment commands extended for multi-model configurations
Aggregation deployment now supports LLaVA, Qwen2.5-VL, and Phi3V via distinct YAML files, mirroring serve examples.


260-260: Added temperature parameter to deployment test request
Including "temperature": 0.0 in the test aligns with updated example payloads and demonstrates end-to-end usage.


265-266: Model replacement instructions updated for deployment test
The guidance to swap in Qwen2.5-VL or Phi3V models is clear and matches the aggregated client examples.

🧰 Tools
🪛 LanguageTool

[style] ~265-~265: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...d with "Qwen/Qwen2.5-VL-7B-Instruct". If serving the example Phi3V model, replac...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

@hhzhang16 hhzhang16 merged commit e924a7c into main Jun 11, 2025
10 checks passed
@hhzhang16 hhzhang16 deleted the hannahz/dep-114-generalize-vlm-embedding-extraction branch June 11, 2025 21:42
@hhzhang16 hhzhang16 self-assigned this Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants