Skip to content

[OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo and LogProbs Processor #7259

@CatherineSue

Description

@CatherineSue

Points: 2-3 days

Description: Unify Logprobs handling and UsageInfo in v1/chat/completions and v1/completions. Reduce the repetitive code, increase code reusability and structure

Deliverables:

  • Complete Task 1 and Task 2
  • Add UTs

Task 1: Token Logprobs Handling

Current logic in adapter.py#L1327-L1368:

logprobs = False
if isinstance(request, list) and request[idx].logprobs:
logprobs = True
elif (not isinstance(request, list)) and request.logprobs:
logprobs = True
if logprobs:
logprobs = to_openai_style_logprobs(
output_token_logprobs=ret_item["meta_info"]["output_token_logprobs"],
output_top_logprobs=ret_item["meta_info"].get(
"output_top_logprobs", None
),
)
token_logprobs = []
for token_idx, (token, logprob) in enumerate(
zip(logprobs.tokens, logprobs.token_logprobs)
):
token_bytes = list(token.encode("utf-8"))
top_logprobs = []
if logprobs.top_logprobs:
for top_token, top_logprob in logprobs.top_logprobs[
token_idx
].items():
top_token_bytes = list(top_token.encode("utf-8"))
top_logprobs.append(
TopLogprob(
token=top_token,
bytes=top_token_bytes,
logprob=top_logprob,
)
)
token_logprobs.append(
ChatCompletionTokenLogprob(
token=token,
bytes=token_bytes,
logprob=logprob,
top_logprobs=top_logprobs,
)
)
choice_logprobs = ChoiceLogprobs(content=token_logprobs)
else:
choice_logprobs = None

New logic in serving_chat.py:

def _process_response_logprobs(self, ret_item: Dict[str, Any]) -> ChoiceLogprobs:
"""Process logprobs for non-streaming response"""
logprobs = to_openai_style_logprobs(
output_token_logprobs=ret_item["meta_info"]["output_token_logprobs"],
output_top_logprobs=ret_item["meta_info"].get("output_top_logprobs", None),
)
token_logprobs = self._process_logprobs_tokens(logprobs, use_token_index=True)
return ChoiceLogprobs(content=token_logprobs)

Non streaming logprobs first calls _process_response_logprobs, then calls _process_logprobs_tokens.

Unify Logprobs

  • Logic is fine, but it has quite some convoluated with streaming logprobs, and completions endpoint.
    Inconsistent Entry Points:
  • Chat has 2 different methods (_process_response_logprobs vs _process_streaming_logprobs ) for similar work
  • Duplicated Logic: Both chat and completions call to_openai_style_logprobs
  • Mixed Responsibilities: Some methods do conversion + processing, others just processing
  • Hard to Test: Complex call chains make unit testing difficult

Design

  • Approach: Create a unified LogProbsProcessor using factory pattern to eliminate code duplication and inconsistent APIs.

  • New File: sglang/python/sglang/srt/entrypoints/openai/logprobs_processor.py

  • High Level Design:

    • serving_chat.py: Replace _process_streaming_logprobs and _process_response_logprobs with factory calls, remove _process_logprobs_tokens
    • serving_completions.py: Replace inline to_openai_style_logprobs calls with factory methods
    • utils.py: Deprecate or remove to_openai_style_logprobs function

Task 2: UsageInfo

Current Problem

  • Code Duplication: aggregate_token_usage (utils.py) vs _calculate_streaming_usage_base (serving_base.py)
  • Different Data Formats: Non-streaming uses response lists, streaming uses token dictionaries
  • Similar Logic: Both calculate total tokens with n_choices handling and cache reporting

Design Recommendation

  • Approach: Create unified UsageProcessor following same factory pattern as LogProbs.

  • New File: sglang/python/sglang/srt/entrypoints/openai/usage_processor.py

  • Files to Update:

    • serving_chat.py: Replace aggregate_token_usage calls with factory methods
    • serving_completions.py: Replace aggregate_token_usage calls with factory methods
    • serving_base.py: Replace _calculate_streaming_usage_base with factory calls
    • utils.py: Deprecate aggregate_token_usage function
  • Functions to Consolidate:

    • aggregate_token_usage (from utils.py) → UsageProcessor.calculate_response_usage
    • _calculate_streaming_usage_base (from serving_base.py) → UsageProcessor.calculate_streaming_usage

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions