-
Notifications
You must be signed in to change notification settings - Fork 580
feat: set model specific prompt templates in the multimodal config files, add documentation for multimodal example deployment #1366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…-set-model-specific-prompt-templates-in-the-config-file
…-set-model-specific-prompt-templates-in-the-config-file
…-set-model-specific-prompt-templates-in-the-config-file
## Walkthrough
This update introduces prompt template configurability to multimodal and LLM examples, allowing prompt formatting via YAML configuration or CLI. Documentation is expanded for Kubernetes deployment with Dynamo Operator, including ingress testing. Minor code changes ensure prompt templates are applied during request processing, and GPU resource specifications are standardized in configuration files.
## Changes
| File(s) | Change Summary |
|------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| examples/llm/README.md | Clarified deployment testing steps, adding ingress-based direct API call instructions. |
| examples/llm/configs/disagg.yaml | Added `prompt-template` entry to `Processor` config. |
| examples/multimodal/README.md | Renamed sections, added detailed Kubernetes deployment instructions, and expanded testing guidance. |
| examples/multimodal/components/decode_worker.py | Set model loading to use `device_map="auto"`, `torch_dtype=bfloat16`, and `.eval()` for remote prefill. |
| examples/multimodal/components/prefill_worker.py | Simplified and clarified comments regarding dummy token insertion; added a TODO for flexibility. |
| examples/multimodal/components/processor.py | Modified prompt construction to use a configurable template from `engine_args.prompt_template`. |
| examples/multimodal/configs/agg.yaml | Added `prompt-template` to `Processor`; changed GPU resource values from int to string. |
| examples/multimodal/configs/disagg.yaml | Added `prompt-template` to `Processor`; changed GPU resource values from int to string. |
| examples/multimodal/utils/vllm.py | Added `--prompt-template` CLI argument and corresponding attribute in engine args. |
## Sequence Diagram(s)
```mermaid
sequenceDiagram
participant User
participant Processor
participant Model
User->>Processor: Send request with messages and image
Processor->>Processor: Apply prompt template (replace <prompt>)
Processor->>Model: Forward formatted prompt and image
Model-->>Processor: Generate response
Processor-->>User: Return response Possibly related PRs
Poem
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
examples/multimodal/components/prefill_worker.py (1)
249-250
: Address the line length and consider implementing the TODO.The comment simplification is good, but there are a couple of issues to address:
- Line 249 exceeds the 100-character limit (102 characters)
- The TODO suggests making dummy token handling more flexible and model-dependent
Apply this diff to fix the line length:
- # some placeholder dummy tokens are inserted based on the embedding size in the worker.py. + # some placeholder dummy tokens are inserted based on the embedding size in worker.py.Would you like me to help implement a more flexible, model-dependent approach for the dummy token handling?
🧰 Tools
🪛 Pylint (3.3.7)
[convention] 249-249: Line too long (102/100)
(C0301)
[warning] 250-250: TODO: make this more flexible/model-dependent
(W0511)
examples/multimodal/README.md (2)
100-102
: Inconsistent section heading terminology
The disaggregated example now uses “### Local Serving” while the aggregated example retains “### Deployment”. This mismatch could confuse readers. Consider renaming for parity—e.g.:-### Local Serving +### Local Deploymentor
-### Local Serving +### Local Serving (Disaggregated)
162-239
: New Operator deployment section looks solid; minor path correction
The “Deployment with Dynamo Operator” segment is comprehensive and aligns well with the PR objectives. One small nitpick: the commented disaggregated deploy command omits./
beforeconfigs/disagg.yaml
, unlike the aggregated example. For consistency, update as follows:-# dynamo deploy $DYNAMO_TAG -n $DEPLOYMENT_NAME -f configs/disagg.yaml +# dynamo deploy $DYNAMO_TAG -n $DEPLOYMENT_NAME -f ./configs/disagg.yaml
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
examples/llm/README.md
(1 hunks)examples/llm/configs/disagg.yaml
(1 hunks)examples/multimodal/README.md
(2 hunks)examples/multimodal/components/decode_worker.py
(1 hunks)examples/multimodal/components/prefill_worker.py
(1 hunks)examples/multimodal/components/processor.py
(1 hunks)examples/multimodal/configs/agg.yaml
(3 hunks)examples/multimodal/configs/disagg.yaml
(3 hunks)examples/multimodal/utils/vllm.py
(2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
examples/multimodal/components/prefill_worker.py
[convention] 249-249: Line too long (102/100)
(C0301)
[warning] 250-250: TODO: make this more flexible/model-dependent
(W0511)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (8)
examples/multimodal/utils/vllm.py (2)
54-59
: LGTM! Clean implementation of prompt template CLI argument.The argument follows the established pattern and provides a sensible default value.
68-68
: LGTM! Proper assignment of parsed argument.The prompt template is correctly assigned to the engine args object, maintaining consistency with other custom arguments.
examples/multimodal/configs/disagg.yaml (2)
23-23
: LGTM! Appropriate prompt template for multimodal applications.The template format correctly includes both image and text placeholders, suitable for multimodal processing.
34-34
:❓ Verification inconclusive
Verify GPU resource specification type change.
The GPU resource specifications were changed from integer
1
to string'1'
. Please ensure this change is intentional and that the configuration parser correctly handles string values for GPU resources.Also applies to: 42-42, 51-51
🏁 Script executed:
#!/bin/bash # Check if other config files use string or integer values for GPU resources echo "Checking GPU resource specifications across config files:" rg -A 2 -B 2 "gpu:" examples/Length of output: 13853
Verify parser handles mixed GPU spec types and standardize configs
A search across theexamples/
directory shows that GPU counts are inconsistently quoted: many newer configs use string values (gpu: '1'
,'8'
,'16'
), while some still use unquoted integers (gpu: 1
,gpu: 8
, etc.). Please ensure your YAML‐loading logic correctly casts string values to integers for GPU resources, and consider normalizing all example configs to a single format.Files still using integer GPU specs:
- examples/sglang/configs/disagg.yaml (
gpu: 1
)- examples/tensorrt_llm/configs/agg.yaml (
gpu: 1
)- examples/vllm_v0/configs/disagg.yaml (
gpu: 1
)Either update these to quoted strings or confirm the parser accepts both types without issue.
examples/llm/README.md (1)
228-229
: LGTM! Helpful documentation improvement.The clarification about using ingress URLs directly improves the user experience by providing a more streamlined testing approach when ingress is available.
examples/multimodal/configs/agg.yaml (2)
22-22
: LGTM! Configurable prompt template implementation.The addition of the
prompt-template
configuration enables flexible prompt formatting for multimodal interactions. The template format properly includes placeholders for image and prompt content.
34-34
: GPU resource format standardization.The change from integer
1
to string'1'
for GPU resources is likely required for proper YAML parsing or Kubernetes resource specifications. This standardization improves consistency across configuration files.Also applies to: 43-43
examples/multimodal/components/decode_worker.py (1)
137-141
: LGTM! Improved model loading configuration.The updates enhance model loading with:
device_map="auto"
for optimal device allocationtorch_dtype=torch.bfloat16
for efficient precision.eval()
for proper inference modeThese changes improve performance and resource utilization in remote prefill mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the changes!
Overview:
Details:
Curl request output for agg:
Curl request output for disagg:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation
Style
Refactor
Chores