-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Points: 2-3 days
Description: Unify Logprobs handling and UsageInfo in v1/chat/completions
and v1/completions
. Reduce the repetitive code, increase code reusability and structure
Deliverables:
- Complete Task 1 and Task 2
- Add UTs
Task 1: Token Logprobs Handling
Current logic in adapter.py#L1327-L1368
:
sglang/python/sglang/srt/openai_api/adapter.py
Lines 1327 to 1368 in ca92911
logprobs = False | |
if isinstance(request, list) and request[idx].logprobs: | |
logprobs = True | |
elif (not isinstance(request, list)) and request.logprobs: | |
logprobs = True | |
if logprobs: | |
logprobs = to_openai_style_logprobs( | |
output_token_logprobs=ret_item["meta_info"]["output_token_logprobs"], | |
output_top_logprobs=ret_item["meta_info"].get( | |
"output_top_logprobs", None | |
), | |
) | |
token_logprobs = [] | |
for token_idx, (token, logprob) in enumerate( | |
zip(logprobs.tokens, logprobs.token_logprobs) | |
): | |
token_bytes = list(token.encode("utf-8")) | |
top_logprobs = [] | |
if logprobs.top_logprobs: | |
for top_token, top_logprob in logprobs.top_logprobs[ | |
token_idx | |
].items(): | |
top_token_bytes = list(top_token.encode("utf-8")) | |
top_logprobs.append( | |
TopLogprob( | |
token=top_token, | |
bytes=top_token_bytes, | |
logprob=top_logprob, | |
) | |
) | |
token_logprobs.append( | |
ChatCompletionTokenLogprob( | |
token=token, | |
bytes=token_bytes, | |
logprob=logprob, | |
top_logprobs=top_logprobs, | |
) | |
) | |
choice_logprobs = ChoiceLogprobs(content=token_logprobs) | |
else: | |
choice_logprobs = None |
New logic in serving_chat.py
:
sglang/python/sglang/srt/entrypoints/openai/serving_chat.py
Lines 786 to 794 in 70c471a
def _process_response_logprobs(self, ret_item: Dict[str, Any]) -> ChoiceLogprobs: | |
"""Process logprobs for non-streaming response""" | |
logprobs = to_openai_style_logprobs( | |
output_token_logprobs=ret_item["meta_info"]["output_token_logprobs"], | |
output_top_logprobs=ret_item["meta_info"].get("output_top_logprobs", None), | |
) | |
token_logprobs = self._process_logprobs_tokens(logprobs, use_token_index=True) | |
return ChoiceLogprobs(content=token_logprobs) |
Non streaming logprobs first calls _process_response_logprobs
, then calls _process_logprobs_tokens
.
Unify Logprobs
- Logic is fine, but it has quite some convoluated with streaming logprobs, and completions endpoint.
Inconsistent Entry Points: - Chat has 2 different methods (
_process_response_logprobs
vs_process_streaming_logprobs
) for similar work - Duplicated Logic: Both chat and completions call
to_openai_style_logprobs
- Mixed Responsibilities: Some methods do conversion + processing, others just processing
- Hard to Test: Complex call chains make unit testing difficult
Design
-
Approach: Create a unified
LogProbsProcessor
using factory pattern to eliminate code duplication and inconsistent APIs. -
New File:
sglang/python/sglang/srt/entrypoints/openai/logprobs_processor.py
-
High Level Design:
serving_chat.py
: Replace_process_streaming_logprobs
and_process_response_logprobs
with factory calls, remove_process_logprobs_tokens
serving_completions.py
: Replace inlineto_openai_style_logprobs
calls with factory methodsutils.py
: Deprecate or removeto_openai_style_logprobs
function
Task 2: UsageInfo
Current Problem
- Code Duplication:
aggregate_token_usage
(utils.py) vs_calculate_streaming_usage_base
(serving_base.py) - Different Data Formats: Non-streaming uses response lists, streaming uses token dictionaries
- Similar Logic: Both calculate total tokens with n_choices handling and cache reporting
Design Recommendation
-
Approach: Create unified
UsageProcessor
following same factory pattern as LogProbs. -
New File:
sglang/python/sglang/srt/entrypoints/openai/usage_processor.py
-
Files to Update:
serving_chat.py
: Replaceaggregate_token_usage
calls with factory methodsserving_completions.py
: Replaceaggregate_token_usage
calls with factory methodsserving_base.py
: Replace_calculate_streaming_usage_base
with factory callsutils.py
: Deprecateaggregate_token_usage
function
-
Functions to Consolidate:
aggregate_token_usage
(from utils.py) →UsageProcessor.calculate_response_usage
_calculate_streaming_usage_base
(from serving_base.py) →UsageProcessor.calculate_streaming_usage