-
Notifications
You must be signed in to change notification settings - Fork 580
feat: Support OAI frontend format and add async image handing #1214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5f30edf
to
f6e8f83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
WalkthroughThe changes update multimodal example usage to a chat-completion style API with structured messages containing text and image URLs. Backend components now asynchronously load and cache images, extract data from nested message content, support both streaming and aggregated responses, and remove image URLs from logs for privacy. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Frontend
participant Processor
participant EncodeWorker
participant DecodeWorker
participant PrefillWorker
Client->>Frontend: POST /v1/chat/completions (messages: [text, image_url])
Frontend->>Processor: generate(raw_request)
Processor->>Processor: Extract text & image_url from messages
Processor->>EncodeWorker: encode(image_url)
EncodeWorker->>EncodeWorker: Async load & cache image
EncodeWorker-->>Processor: Encoded image features
Processor->>PrefillWorker: prefill(text, image features)
PrefillWorker-->>Processor: Prefilled data
Processor->>DecodeWorker: generate(prefilled data)
DecodeWorker-->>Processor: Generated response
Processor-->>Frontend: Stream or aggregate response
Frontend-->>Client: Return chat-completion response
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
examples/multimodal/utils/chat_processor.py (1)
190-236
: Excellent implementation of non-streaming aggregation.The non-streaming response handling is well-implemented with proper:
- Chunk collection and aggregation
- JSON parsing with appropriate data prefix handling
- Response structure initialization following OpenAI format
- Content concatenation and finish_reason updates
The logic correctly handles the streaming chunks and builds a complete response object.
Consider adding error handling for edge cases:
+ try: response = json.loads(raw_response.lstrip("data: ")) + except json.JSONDecodeError: + logger.warning(f"Failed to parse streaming response: {raw_response}") + continueexamples/multimodal/README.md (2)
85-87
: Add language specification to code block.The response example should specify the language for better documentation quality and syntax highlighting.
-``` +```json {"id": "c37b946e-9e58-4d54-88c8-2dbd92c47b0c", "object": "chat.completion", "created": 1747725277, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " In the image, there is a city bus parked on a street, with a street sign nearby on the right side. The bus appears to be stopped out of service. The setting is in a foggy city, giving it a slightly moody atmosphere."}, "finish_reason": "stop"}]} -``` +```🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
85-85: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
152-154
: Add language specification to code block.Same issue as the previous response example - should specify JSON for proper syntax highlighting.
-``` +```json {"id": "c1774d61-3299-4aa3-bea1-a0af6c055ba8", "object": "chat.completion", "created": 1747725645, "model": "llava-hf/llava-1.5-7b-hf", "choices": [{"index": 0, "message": {"role": "assistant", "content": " This image shows a passenger bus traveling down the road near power lines and trees. The bus displays a sign that says \"OUT OF SERVICE\" on its front."}, "finish_reason": "stop"}]} -``` +```🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
152-152: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
examples/multimodal/README.md
(2 hunks)examples/multimodal/components/decode_worker.py
(5 hunks)examples/multimodal/components/encode_worker.py
(3 hunks)examples/multimodal/components/frontend.py
(2 hunks)examples/multimodal/components/prefill_worker.py
(1 hunks)examples/multimodal/components/processor.py
(2 hunks)examples/multimodal/utils/chat_processor.py
(1 hunks)examples/multimodal/utils/protocol.py
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
examples/multimodal/components/processor.py (4)
examples/multimodal/components/frontend.py (1)
generate
(42-48)examples/multimodal/components/decode_worker.py (1)
generate
(197-336)examples/multimodal/utils/protocol.py (1)
MultiModalRequest
(117-122)examples/llm/components/processor.py (2)
_generate
(189-218)RequestType
(41-43)
🪛 markdownlint-cli2 (0.17.2)
examples/multimodal/README.md
85-85: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
152-152: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (24)
examples/multimodal/components/prefill_worker.py (1)
194-195
: LGTM: Privacy improvement in logging.Removing the
image_url
from the log message is a good privacy practice that reduces potential exposure of sensitive image URLs in logs while maintaining essential debugging information (request_id and engine_id).examples/multimodal/components/decode_worker.py (6)
200-200
: LGTM: Consistent privacy improvement.Removing
image_url
from the log message improves privacy by not exposing potentially sensitive image URLs in logs.
204-207
: LGTM: Privacy improvement in debug logging.Consistent removal of
image_url
from debug log messages maintains essential debugging information while protecting sensitive data.
225-226
: LGTM: Privacy improvement in remote prefill logging.Good practice to remove
image_url
from logs while retaining key debugging information like request ID and prompt length.
238-239
: LGTM: Privacy improvement in local prefill logging.Consistent removal of
image_url
maintains debugging capability while improving privacy.
261-263
: LGTM: Privacy improvement in aggregated mode logging.Removing
image_url
from aggregated mode logs is consistent with the privacy improvements across the codebase.
296-298
: LGTM: Appropriate log level and privacy improvement.Changing from
info
todebug
level is appropriate for detailed execution flow information, and removingimage_url
improves privacy consistency.examples/multimodal/components/frontend.py (3)
16-16
: LGTM: Required import for JSON processing.Adding the
json
import is necessary for the response parsing functionality added below.
41-41
: LGTM: OpenAI-compatible endpoint naming.Using
"v1/chat/completions"
aligns with OpenAI's API specification for chat completion endpoints.
48-48
: LGTM: Proper streaming response configuration.Using
StreamingResponse
with"text/event-stream"
media type is appropriate for server-sent events format commonly used in chat completion APIs.examples/multimodal/utils/chat_processor.py (1)
177-189
: LGTM: Clean streaming response handling.The streaming response logic preserves the existing behavior and properly delegates to the OpenAI serving layer.
examples/multimodal/README.md (2)
58-82
: LGTM! Clean API format update.The updated curl example properly demonstrates the new chat completion style API with structured messages containing mixed text and image content. This aligns well with the OpenAI format standard.
125-149
: LGTM! Consistent API format for disaggregated serving.The disaggregated serving example properly mirrors the aggregated serving format with the same structured message API.
examples/multimodal/components/processor.py (3)
151-151
: LGTM! Simplified streaming response handling.The direct yielding of responses is cleaner than manual aggregation and properly delegates streaming logic to the underlying components.
204-212
: Good error handling for image URL extraction.The iteration through messages and content items to find the image URL is well-implemented, and the ValueError for missing image URLs provides clear feedback.
213-214
: LGTM! Clean integration with the updated API.The call to
_generate
with the extracted image URL and the JSON serialization of responses properly implements the new chat completion format.examples/multimodal/utils/protocol.py (3)
95-109
: Excellent structured content design.The
TextContent
andImageContent
models withLiteral
types provide strong type safety and align perfectly with the OpenAI chat completion format. The union typeMessageContent
enables flexible mixed content.
112-115
: Good role-based message structure.The
ChatMessage
model properly encapsulates the role and content list, supporting the multi-modal conversation format expected by the processor.
117-123
: Well-designed request model refactor.The updated
MultiModalRequest
withmessages
field, optionalmax_tokens
, andstream
flag properly supports the new chat completion API while maintaining backward compatibility through sensible defaults.examples/multimodal/components/encode_worker.py (5)
16-23
: Good async-focused imports.The addition of
asyncio
,base64
,binascii
, andhttpx
properly supports the async image loading functionality. Theurlparse
import enables proper URL handling for different schemes.
77-79
: Proper async HTTP client setup.The optional HTTP client initialization and timeout configuration provide good foundation for async operations.
80-148
: Excellent async image loading implementation.This is a well-designed async image loading method with several strengths:
- Multi-format support: Handles both HTTP(S) URLs and data URLs
- Proper caching: LRU cache with size limits and eviction
- Robust validation: Checks image formats and converts to RGB
- Thread safety: Uses
asyncio.to_thread
for PIL operations- Comprehensive error handling: Covers HTTP errors, validation errors, and decoding issues
The implementation properly handles the async nature while maintaining image processing quality.
165-208
: Good error handling and logging practices.The try-catch block in the encode method provides proper error propagation while maintaining detailed logging. The privacy-conscious logging (omitting image URLs) is appropriate for production environments.
209-218
: Proper async initialization.Moving HTTP client initialization to
async_init
ensures proper async context and resource management.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Refactor