[Refactor] OAI Server components #7167

JustinTong0323 · 2025-06-14T01:13:11Z

Motivation

[OAI Server Refactor] Define Initial Serving Logic Structure #7104

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

- Introduced new modules for handling OpenAI-compatible API requests, including chat and completion serving logic. - Implemented request validation rules for chat and completion requests. - Added utility functions for processing and formatting requests and responses. - Included Pydantic models for defining request and response structures. This commit lays the groundwork for integrating OpenAI API functionalities into the SGLang framework. Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Consolidates request handling logic into the base class to reduce code duplication. Moves the common request validation, context creation, and request dispatch logic to the OpenAIServingBase class. This change streamlines the structure of the handlers for chat completions, completions, and embeddings. The individual handler classes now only need to implement the conversion to internal format and the specific streaming and non-streaming handling logic. Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Adds comprehensive docstrings to the OpenAI API endpoint modules, including descriptions of key features, processing pipelines, and architecture. This improves code maintainability and provides better understanding of the purpose and functionality of each module. Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Improves thread safety by making the chat template caching mechanism instance-specific, moving it from a global scope to the ChatCompletionHandler class. This ensures that each handler instance maintains its own cache, preventing potential conflicts when multiple instances are used concurrently. Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Refactors the logprobs processing logic into a common helper function to avoid duplication between streaming and non-streaming responses. This change improves code maintainability and reduces the risk of inconsistencies in logprobs handling. Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

python/sglang/srt/entrypoints/openai/serving_base.py

python/sglang/srt/entrypoints/openai/utils.py

python/sglang/srt/entrypoints/openai/validation.py

python/sglang/srt/entrypoints/openai/utils.py

python/sglang/srt/entrypoints/openai/serving_base.py

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

JustinTong0323 · 2025-06-15T04:07:52Z

/gemini summary

gemini-code-assist · 2025-06-15T04:08:23Z

Summary of Changes

This pull request undertakes a significant refactor of the OpenAI-compatible server components. The primary goal is to improve the code structure by introducing a base class for request handling, formalizing the API contract using Pydantic models, and separating endpoint-specific logic into dedicated handlers. This reorganization aims to enhance maintainability, testability, and provide a clearer foundation for implementing future features and ensuring better adherence to the OpenAI API specification.

Highlights

Architectural Refactor: Introduced a new abstract base class OpenAIServingBase (python/sglang/srt/entrypoints/openai/serving_base.py) to provide common structure and logic for handling OpenAI-compatible requests across different endpoints.
Formalized API Protocol: Defined comprehensive Pydantic models for OpenAI API requests and responses in a new protocol.py file (python/sglang/srt/entrypoints/openai/protocol.py), including SGLang-specific extensions, to formalize the API contract and enable Pydantic validation.
Dedicated Endpoint Handlers: Created separate handler classes (OpenAIServingChat, OpenAIServingCompletion, OpenAIServingEmbedding) inheriting from the base class to encapsulate endpoint-specific logic for chat completions (python/sglang/srt/entrypoints/openai/serving_chat.py), text completions (python/sglang/srt/entrypoints/openai/serving_completions.py), and embeddings (python/sglang/srt/entrypoints/openai/serving_embedding.py).
Centralized Utilities: Extracted common utility functions, such as template processing helpers, token usage calculation, and logprobs formatting, into a new utils.py module (python/sglang/srt/entrypoints/openai/utils.py).
New Unit Tests: Added extensive unit tests for the new protocol models and the refactored serving handlers (test/srt/openai/test_protocol.py, test/srt/openai/test_serving_chat.py, test/srt/openai/test_serving_completions.py, test/srt/openai/test_serving_embedding.py) to ensure correctness and maintainability.

Changelog

python/sglang/srt/entrypoints/openai/protocol.py
- New file defining Pydantic models for OpenAI API entities: ModelCard, ModelList, ErrorResponse, LogProbs, TopLogprob, ChatCompletionTokenLogprob, ChoiceLogprobs, UsageInfo, StreamOptions, JsonSchemaResponseFormat, FileRequest, FileResponse, FileDeleteResponse, BatchRequest, BatchResponse, CompletionRequest, CompletionResponseChoice, CompletionResponse, CompletionResponseStreamChoice, CompletionStreamResponse, ChatCompletionMessageContentTextPart, ChatCompletionMessageContentImageURL, ChatCompletionMessageContentAudioURL, ChatCompletionMessageContentImagePart, ChatCompletionMessageContentAudioPart, ChatCompletionMessageContentPart (Union), FunctionResponse, ToolCall, ChatCompletionMessageGenericParam, ChatCompletionMessageUserParam, ChatCompletionMessageParam (Union), ResponseFormat, StructuresResponseFormat, StructuralTagResponseFormat, Function, Tool, ToolChoiceFuncName, ToolChoice, ChatCompletionRequest, ChatMessage, MultimodalEmbeddingInput, EmbeddingInput (Union), EmbeddingRequest, EmbeddingObject, EmbeddingResponse, ScoringRequest, ScoringResponse, OpenAIServingRequest (Union).
- Includes SGLang-specific fields within standard models (e.g., top_k, min_p, regex, ebnf, repetition_penalty, stop_token_ids, no_stop_trim, ignore_eos, skip_special_tokens, lora_path, session_params, separate_reasoning, stream_reasoning, chat_template_kwargs, rid, bootstrap_* parameters).
- Adds Pydantic field and model validators for max_tokens, messages, and tool_choice.
python/sglang/srt/entrypoints/openai/serving_base.py
- New file defining OpenAIServingBase, an abstract base class for OpenAI endpoint handlers.
- Provides a common handle_request method for request validation, conversion to internal format, and dispatching to streaming/non-streaming handlers.
- Includes abstract methods _request_id_prefix, _convert_to_internal_request, _handle_streaming_request, and _handle_non_streaming_request to be implemented by subclasses.
- Implements base request ID generation (_generate_request_id_base).
- Includes a base method for calculating streaming usage (_calculate_streaming_usage_base).
- Provides helper methods for creating standard and streaming error responses (create_error_response, create_streaming_error_response).
python/sglang/srt/entrypoints/openai/serving_chat.py
- New file implementing OpenAIServingChat inheriting from OpenAIServingBase to handle /v1/chat/completions requests.
- Implements _request_id_prefix for chat completion IDs.
- Implements _validate_request with specific validation for chat messages.
- Implements _convert_to_internal_request including logic for processing messages, applying Jinja or conversation templates, handling multimodal content, and building sampling parameters (including response formats, tools, and SGLang extensions).
- Implements _handle_streaming_request for streaming chat responses, including handling delta messages, logprobs, reasoning content, and tool calls.
- Implements _handle_non_streaming_request for non-streaming chat responses, building the final response structure including logprobs, reasoning content, and tool calls.
- Includes helper methods for processing logprobs (_process_logprobs_tokens, _process_response_logprobs, _process_streaming_logprobs), tool calls (_process_tool_calls, _process_tool_call_stream), and reasoning content (_process_reasoning_stream).
python/sglang/srt/entrypoints/openai/serving_completions.py
- New file implementing OpenAIServingCompletion inheriting from OpenAIServingBase to handle /v1/completions requests.
- Implements _request_id_prefix for completion IDs.
- Implements _validate_request with specific validation for completion prompts (string, list of strings, list of token IDs, list of lists of token IDs).
- Implements _convert_to_internal_request including logic for processing prompts (with optional suffix and completion template) and building sampling parameters.
- Implements _handle_streaming_request for streaming completion responses, including handling echo and logprobs.
- Implements _handle_non_streaming_request for non-streaming completion responses, building the final response structure including echo and logprobs.
- Includes helper methods for handling echo text (_get_echo_text, _prepare_echo_prompts).
python/sglang/srt/entrypoints/openai/serving_embedding.py
- New file implementing OpenAIServingEmbedding inheriting from OpenAIServingBase to handle /v1/embeddings requests.
- Implements _request_id_prefix for embedding IDs.
- Implements _validate_request with specific validation for embedding input types.
- Implements _convert_to_internal_request including logic for processing various input types (string, list of strings, list of token IDs, list of lists of token IDs, list of MultimodalEmbeddingInput) and noting that multimodal batch requests are not yet supported.
- Implements _handle_non_streaming_request for embedding requests (streaming is not supported).
- Implements _build_embedding_response to construct the final embedding response including embedding objects and usage information.
python/sglang/srt/entrypoints/openai/utils.py
- New file containing shared utility functions.
- Includes logic for detecting Jinja template content format (detect_template_content_format) and processing message content based on format (process_content_for_template_format).
- Provides functions for calculating and aggregating token usage (calculate_token_usage, aggregate_token_usage).
- Includes a function to convert internal logprobs format to OpenAI style (to_openai_style_logprobs).
test/pytest.ini
- Added asyncio_mode = auto to the pytest configuration.
test/srt/openai/test_protocol.py
- New file with unit tests for the Pydantic models defined in protocol.py, covering creation, serialization, and validation.
test/srt/openai/test_serving_chat.py
- New file with unit tests for the OpenAIServingChat handler, covering request conversion, tool calls, multimodal content, template handling, reasoning content, sampling parameters, and compatibility with previous adapter logic.
test/srt/openai/test_serving_completions.py
- New file with unit tests for the OpenAIServingCompletion handler, covering prompt handling, echo functionality, request conversion, response building, and async method testing.
test/srt/openai/test_serving_embedding.py
- New file with unit tests for the OpenAIServingEmbedding handler, covering request validation and conversion, response building, async method testing, and edge cases like multimodal batching.

Activity

The author @JustinTong0323 requested a review and summary from the bot.
The bot provided an initial summary and a detailed review with High and Medium priority feedback.
High priority feedback included concerns about global caching in serving_chat.py, unimplemented multimodal batching in serving_embedding.py, and incorrect top_logprobs processing in streaming chat.
Medium priority feedback covered various aspects like docstrings, broad exception handling, the inclusion of SGLang-specific extensions in standard protocol models, details of validation logic, missing text_offset in logprobs, and the use of 'padding' for multimodal text.
@JustinTong0323 responded to high-level design questions raised in the review, discussing the necessity of RequestContext, strategies for _build_sampling_params, and the placement of validation logic.
@CatherineSue provided review comments, questioning the usage of certain functions, the need for RequestContext, the placement of validation logic (suggesting moving it to individual serving classes or noting existing validation in sampling_params.py), and suggesting moving error response helpers to the base class.
@JustinTong0323 addressed some of @CatherineSue's comments, confirming some functions could be deleted and discussing validation placement.
The bot provided a second review, identifying Critical issues related to passing the entire request object instead of request.n to _calculate_streaming_usage_base in both chat and completions handlers.
The second bot review also highlighted a High priority syntax issue in the input_file_id type hint in protocol.py.
Medium priority issues from the second bot review included the presence of 'abort' and 'function_call' in finish reasons, the tool_call_constraint TODO, the 'padding' approach for multimodal text, the text_offset TODO, the source of the enable_thinking parameter, chat message validation strictness, and the warning about echo + logprobs.

CatherineSue

This is in good shape. I think after resolving the comments in tests, we can run a local ut and make sure all tests pass. Merge this.

Great work!

python/sglang/srt/entrypoints/openai/serving_base.py

test/srt/openai/test_serving_chat.py

test/srt/openai/test_serving_completions.py

test/srt/openai/test_serving_chat.py

test/srt/openai/test_serving_embedding.py

python/sglang/srt/entrypoints/openai/serving_chat.py

python/sglang/srt/entrypoints/openai/serving_completions.py

Updates the error response creation methods to return ORJSONResponse instead of ErrorResponse. This change enhances the response formatting for error handling, ensuring consistency and improved performance in API responses. Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Co-authored-by: Chang Su <csu272@usc.edu>

Updates test files to rename the handler fixtures to `serving_*` for better clarity and consistency across the tests. This change improves the readability of the tests and makes it easier to understand which object is being tested in each test case. Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Passes the tool call constraint to the sampling parameters and incorporates tool call constraint handling in sampling parameter building. This allows the model to respect constraints specified for tool calls during the sampling process. Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

JustinTong0323 · 2025-06-16T09:22:21Z

All related tests passed:

CatherineSue

Approved. Looks like lint is failing. Let's fix the lint.

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

jhinpan · 2025-06-17T04:29:25Z

Wondering do we need to add test files here to CI suite as well like test_serving_chat.py and test_serving_completions.py?

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

@mickqian

* Use seq_len_fill_value in the cuda graph runners (sgl-project#7233) * support custom weight loader for model runner (sgl-project#7122) Co-authored-by: kavioyu <kavioyu@tencent.com> * Fix AMD speculative decoding (sgl-project#7252) * [Refactor] OAI Server components (sgl-project#7167) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * OAI Server Skeleton & Core Utility Endpoints (sgl-project#7179) * [amd] Opt dsv3 moe (sgl-project#7160) Co-authored-by: wunhuang <wunhuang@amd.com> * update ci node for xeon (sgl-project#7265) * feat: mtp support dp-attention (sgl-project#6081) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * support qwen2 running on ascend npu device (sgl-project#7022) Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> * Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (sgl-project#7164) * bugfix(tool call ebnf): Fix EBNF generation for optional function parameters (sgl-project#7283) * Fix AWQ Dequant and Weight Loading of deepseek v2 (sgl-project#6842) * fix: resolve b200 dsv3 mtp issue (sgl-project#7286) * ci: Fix test_ebnf_generate_all_optional_function_params (sgl-project#7288) * fix: only enable flash_attn test on sm80 sm90 (sgl-project#7289) * [PD] Support get local ip from NIC for PD disaggregation (sgl-project#7237) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [PD] Add custom memory pool option to support Mooncake PD with NVLink (sgl-project#7264) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Upstreaming hicache bug fixes (sgl-project#7267) * Update python API of activation, topk, norm and rope and remove vllm dependency (sgl-project#6614) Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> * Fix hicache benchmark script bug - some sampled input_request is [] (sgl-project#7300) * chore: change logs from`INFO` to `DEBUG` for dp and add force quit for tokenizer manager (sgl-project#7251) * update invalid link in doc (sgl-project#7297) * Fix mini_lb for PD with long output: limit chunk size of decode response (sgl-project#7301) Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> * Fix profiler error when there are idle passes (sgl-project#7003) * [pd] optimize dockerfile for pd disaggregation (sgl-project#7319) Co-authored-by: zhyncs <me@zhyncs.com> * Merge PDLB (Prefill-Decode Load Balancer) into SGLang Router (sgl-project#7096) * Add more refactored openai test & in CI (sgl-project#7284) * fix: resolve blackwell deepep image issue (sgl-project#7331) * add seed in CPU UTs to avoid flaky failure (sgl-project#7333) * Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (sgl-project#7099) * Reintroduce tiny fix sampler error when prob is not contiguous (sgl-project#7354) * [Refactor] Clean up radix cache related API (sgl-project#7303) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Put `_normalize_rid` before other normalization in `io_struct` (sgl-project#7363) * [PD] Transfer hidden states for mtp when disaggregation (sgl-project#7242) * [Bugfix][PD] Set conclude state before clear when failure happens (sgl-project#7362) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * docs: update installation (sgl-project#7366) * [Docker] optimize dockerfile remove deepep and blackwell merge it to… (sgl-project#7343) Co-authored-by: Yineng Zhang <me@zhyncs.com> * Clean unused import for mimo mtp model (sgl-project#7370) * [Bugfix]Fix hang bug using dp attention with HiRadixCache (sgl-project#7159) Signed-off-by: huanglong <huanglong@linux.alibaba.com> * [Doc] add embedding rerank doc (sgl-project#7364) * Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (sgl-project#7371) * Feat/refactor embedding server (sgl-project#7322) * Purge VerlEngine (sgl-project#7326) Signed-off-by: Ata Fatahi <immrata@gmail.com> * support return logprobs for pipeline (sgl-project#7356) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * [PD] Optimize custom mem pool usage and bump mooncake version (sgl-project#7393) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support THUDM/GLM-4-0414 (GLM-Z1) Glm4ForCausalLM architecture. (sgl-project#5485) * Refine OpenAI serving entrypoint to remove batch requests (sgl-project#7372) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: Chang Su <csu272@usc.edu> * [Feature] Comprehensive Hybrid Parallelism Support (sgl-project#6389) * [DeepSeekNextN] fix: residual of head norm can be None (sgl-project#7398) * [OAI refactor] Add rerank and score serving (sgl-project#7399) Co-authored-by: Chang Su <chang.s.su@oracle.com> * [OAI Server Refactor] [ChatCompletions & Completions] Implement UsageInfo Processor (sgl-project#7360) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix All-Gather under world size one (sgl-project#7219) * Optimize DP attn scheduling for speculative decoding (sgl-project#7285) * Update usage_processor.py (sgl-project#7402) * Fix 7285 Merge Conflicts (sgl-project#7403) * chore: upgrade mooncake-transfer-engine 0.3.4 (sgl-project#7401) * [OAI Server Refactor] [ChatCompletions & Completions] Support Return Hidden State (sgl-project#7329) Signed-off-by: keru <rukeyang@gmail.com> * Remove batches api in docs & example (sgl-project#7400) * [BugFix]: fix EmbeddingReqInput single input error (sgl-project#7396) * [BugFix]fix qwen25 invoke function call streaming responses with curly braces as the starting indicator (sgl-project#7394) * fix overlap pagecount (sgl-project#6984) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * fix: Fix CI test_function_call_parser.py (sgl-project#7425) * Fix CPU offloading for MLA memory pool (sgl-project#7409) * [fix] PD disaggregation when enable mtp and tp!=dp (sgl-project#7420) * feat(oai refactor): Replace `openai_api` with `entrypoints/openai` (sgl-project#7351) Co-authored-by: Jin Pan <jpan236@wisc.edu> * Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (sgl-project#7412) * refactor(test): reorganize OpenAI test file structure (sgl-project#7408) * [minor] simplify the `TokenToKVPoolAllocator` (sgl-project#7414) * Tiny add logging for GC (sgl-project#7406) * FlashInfer NVFP4 MoE with EP & 2-stream shared expert (sgl-project#7327) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> * Remove copy after bmm (sgl-project#7441) * Fix torch compile run (sgl-project#7391) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> * [misc] Add PD service discovery support in router (sgl-project#7361) * add fused moe config for qwen3 in triton3.3.1 (sgl-project#7445) * Fix CUDA Graph Check under Deepep with DP FFN (sgl-project#7451) * Update hyperparameter_tuning.md (sgl-project#7454) * feat: integrate deepgemm into EPMoE (sgl-project#6821) Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Solve docker build failed in the virtual machine (sgl-project#7290) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: HAI <hixiao@gmail.com> * Fix a bug in BatchTokenIDOut & Misc style and dependency updates (sgl-project#7457) * [CI] Upgrade mooncake to 0.3.4.post1 to fix 8 gpu tests (sgl-project#7472) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Fix prefill OOM due to wrong token calculation when page > 1 (sgl-project#7397) * feat(func_call): Add more check in `BaseFormatDetector.parse_streaming_increment` (sgl-project#7479) * Fix dtype for idle input in spec decoding (sgl-project#7456) * update mooncake in dockerfile (sgl-project#7480) * kvcache io kernels and test case (sgl-project#7382) * [perf] slightly imporve DeepSeek-R1-FP4 TP8 (sgl-project#7481) * Quick fix for DeepGemm requant to also cover MTP. (sgl-project#7378) * Support weight loading without mmap (sgl-project#7469) * ci: Revert openai_server related tests in AMD suites (sgl-project#7449) * Perormance: Enable cuda graph for dp idle batch (sgl-project#7269) Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: ch-wan <cwan39@gatech.edu> * bugfix: Prevent global mutation of conv.stop_str across requests (sgl-project#7347) Co-authored-by: Chang Su <chang.s.su@oracle.com> * Fix RequestValidationError response format (sgl-project#7487) * Fix MTP with Deepseek R1 Fp4 (sgl-project#7376) * chore: bump sgl-kernel v0.2.0 (sgl-project#7490) * chore: bump v0.4.8 (sgl-project#7493) * [AMD] add aiter fused moe in DeepEP path (sgl-project#7268) * enable aiter_biased_grouped_topk kernel (sgl-project#7423) * [PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236) * Remove cumsum_buffer initilization (sgl-project#7439) * [benchmark] fbgemm benchmark support bandwidth report and support fbgemm_cutlass_gmm (sgl-project#7422) * Support multi-thread model weight loading (sgl-project#7277) * [PD] NIXL: Register kv args in advance and cleanup finished requests (sgl-project#6717) * fix: Add `--model` as an alias for `--model-path` in server_args (sgl-project#7505) * misc: Improvement to serving_chat.py and add more ut (sgl-project#7489) * Fuse sorted_token_ids padding to moe_align_block_size kernel (sgl-project#7437) * [OAI] patch origin request_id logic (sgl-project#7508) * [PD][Spec] Fix hidden state transfer for spec decode (sgl-project#7516) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * EPLB support for MTP (sgl-project#7510) * clean duplicate code (sgl-project#7512) * [ci] add router benchmark script and CI (sgl-project#7498) * fix: force synchronization between TP workers when update_weights (sgl-project#6626) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * [CPU] [BF16] Call fused_experts_cpu, weight_packed_linear and bmm_cpu kernel in DeepSeek model (sgl-project#6641) Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CI] Upgrade mooncake to v0.3.4.post2 to fix potential slice failed bug (sgl-project#7522) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * npu fused op (sgl-project#7386) Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> * feat: send kvmetrics from sglang scheduler (sgl-project#6721) * [PD] Add different TP sizes support for no-MLA models (sgl-project#6793) Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> * enable aiter fp8 blockscale quant (sgl-project#7520) * take aiter get_rope back (sgl-project#7521) * Fix typo of flash_cache (sgl-project#7513) * feat: add return hidden_states at async generation (sgl-project#7507) * minor: 'role' must be system/assistant/tool, but case insensitive for now (sgl-project#7499) * Fix FP8 KV Cache Support in FA3 Backend (sgl-project#7148) * Fix gathered_buffer issues in tbo (sgl-project#7531) * [PD] Raise error for incompatible mooncake version and some minor fixes (sgl-project#7527) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CMake] Fix sgl-kernel CMakeLists for Blackwell (sgl-project#7543) * Add Tencent HunYuanMoEV1 model support (sgl-project#7549) * Update seed in CPU UTs to avoid flaky failure with single test (sgl-project#7544) * chore: improve ci bug reporting (sgl-project#7542) * chore: remove vlm unnecessary import (sgl-project#7541) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * chore: bump v0.4.8.post1 (sgl-project#7559) * [PD][NIXL] Set is_sorted=False to fix NIXL_ERR_NOT_FOUND (sgl-project#7330) * [Fix] incorrect assert in EPLB (sgl-project#7575) * Updates Gemma3n MLP layer to adapt latest transformers version (sgl-project#7573) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix MTP error when enabling two-batch overlap (sgl-project#7569) * Add e2e test for multi instance multi stage memory release/resume occupuation (sgl-project#7208) Signed-off-by: Ata Fatahi <immrata@gmail.com> * [CI] Add CI Testing for Prefill-Decode Disaggregation with Router (sgl-project#7540) * Updates transformers and timm dependencies (sgl-project#7577) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * feat: support compatibility between MTP and two-batch-overlap (sgl-project#7225) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Move multimodal processors into a separate folder (sgl-project#7581) * Fix broken CI TestVILAServer (sgl-project#7610) * [router] add centralized configuration module for sgl-router (sgl-project#7588) * Fix: Minicpm (sgl-project#7612) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Hybrid kv cache for LLaMA4 (sgl-project#6563) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> * [CPU] add optimizations for INT8 and FP8 DeepSeek (sgl-project#6769) Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> * Tiny add logs for expert location updater (sgl-project#7308) * Fix flakiness in LoRA batch test. (sgl-project#7552) * [BUG] fix local_rank in initialize_dp_attention (sgl-project#7584) * Support dynamic LoRA loading / unloading in engine/server API (sgl-project#7446) * [PD] Respect sampling_params.max_new_tokens when PD disaggregation is activated (sgl-project#7598) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix unit tests (sgl-project#7618) * Let ep_scatter support arbitrary strides / ue8m0 format (sgl-project#7309) * Let EP prefill support new DeepGEMM (sgl-project#7310) * docs: add gb200 nvl72 and a16z grant (sgl-project#7620) * oai: Adds support for OpenAI chat completions API in bench_serving (sgl-project#7036) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [bugfix] Remove PR comment posting from Rust benchmark workflow (sgl-project#7625) * [Minor] clean up multimodal processor and tokenizer manager (sgl-project#7624) * Add dsv3 fused a gemm to sgl-kernel (sgl-project#7630) * Add @mickqian as the CODEOWNERS of multimodal (sgl-project#7636) * Fix stream reasoning parser and Adds Kimi reasoning parser (sgl-project#7432) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix sgl-router startup crash (sgl-project#7619) * [bugfix] fix runtime dropping panic in editable (sgl-project#7628) * Move files related to EPLB (sgl-project#7580) * [misc] reduce weird rope_scaling_factor warning (sgl-project#7176) * [AMD] Add unit-test-sgl-kernel-amd to AMD CI (sgl-project#7539) * Update CODEOWNERS (sgl-project#7640) * [EAGLE] remove a wrong adjustment for page_size > 1 & topk > 1 in server_args.py (sgl-project#7643) * [CPU] add c++ kernel to bind CPU cores and memory node (sgl-project#7524) * Improve streaming, log_level, memory report, weight loading, and benchmark script (sgl-project#7632) Co-authored-by: Kan Wu <wukanustc@gmail.com> * Add dsv3 router gemm kernel (sgl-project#7627) * chore: upgrade flashinfer v0.2.7 jit (sgl-project#7663) * [doc] update lws doc for pd (sgl-project#7318) * Fix: sync prepare_fp8_layer_for_marlin with latest vllm changes (sgl-project#7648) * Add small requirements for benchmark/parse_result tools (sgl-project#7671) * [CPU] remove process_group from inputs of shm_allreduce and shm_allgather (sgl-project#7486) * chore: bump sgl-kernel v0.2.1 (sgl-project#7675) * support llama4 eagle3 (sgl-project#6985) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: yizhang2077 <1109276519@qq.com> * Refactor mm processors and Enable mixed modality processing (sgl-project#7629) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * upgrade sgl kernel to 0.2.1 for main (sgl-project#7676) * add description for llama4 eagle3 (sgl-project#7688) * fix(model loader): use safe_open to prevent file handle leaks. (sgl-project#7684) * chore: upgrade flashinfer v0.2.7.post1 (sgl-project#7698) * Improve error handling for requests with unloaded LoRA path(s) (sgl-project#7642) * Apply dsv3_fused_a_gemm kernel (sgl-project#7635) * Fix GPTQMarlinMoE (sgl-project#7697) * [1/n] apply wna16marlin kernel in moe weight only quantization (sgl-project#7683) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> * Apply dsv3 router gemm kernel for deepseek-r1 fp4 (sgl-project#7677) * [AMD] Temporarily disable test_no_overlap_scheduler and test_vision_chunked_prefill (sgl-project#7717) * [RL] add --skip-warmup (sgl-project#7416) * [RL] support update_weights_from_distributed with different group and multiple weights (sgl-project#7292) * [router] add --log-level to sgl-router (sgl-project#6512) * [b200] support trt-llm allreduce fuse rms_norm_add kernel (sgl-project#7621) * [CPU] Bind threads and numa node for each TP rank (sgl-project#6549) Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * Support non-contiguous query input for extend/decode attention (sgl-project#7462) * Support updating weights at once by stopping all requests (sgl-project#6698) Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> * Fix num_tokens_pre_allocated in disaggregation log (sgl-project#7714) * [CPU] [sgl-kernel] set dispatch key of initialize to CatchAll (sgl-project#7734) * [CPU] fix all_reduce and all_gather (sgl-project#6770) Co-authored-by: blzheng <beilei.zheng@intel.com> * fix awq and dsv3 fused gemm compatible (sgl-project#7735) * [CI][Router] Fix bench_one_batch_server for pd router test (sgl-project#7731) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (sgl-project#7278) Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> * fix dsv3 fused proj check (sgl-project#7738) * Ascend attention backend(PA&MLA) (sgl-project#7722) Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> * [fix] fix dsv3_router_gemm filter (sgl-project#7750) * [CPU] refine CPU integration code (sgl-project#7647) * [CPU] support the case where num_attention_heads or intermediate_size is not divisible by the TP size (sgl-project#6771) * support qwen3 dense model dp attention (sgl-project#7681) * [optimize] add two stream norm for qwen3 (sgl-project#7740) Co-authored-by: ispobock <ispobaoke@gmail.com> * feat: use D2D instead of H2H in pp (sgl-project#7673) Co-authored-by: alpha-baby <fujianhao1997@qq.com> * [Bug] add flashinfer bool check for fusedmoe in Qwen moe models (sgl-project#7723) * [fix] put cpu in the first priority in get_device() (sgl-project#7752) * [optimize] fuse renormalize into moe_topk_softmax (sgl-project#7744) Co-authored-by: ispobock <ispobaoke@gmail.com> * chore: bump sgl-kernel 0.2.2 (sgl-project#7755) * fix CI: update native api ipynb (sgl-project#7754) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * fuse renormal into moe topk softmax kernel python code (sgl-project#7751) Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * Remove type conversion and fix id map in topk (sgl-project#7759) * Add V2-lite model test (sgl-project#7390) Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> * refactor llama4 dp attention logic (sgl-project#7729) * fix(docs): fix the broken link in `docs/references/production_metrics.md` (sgl-project#7741) Signed-off-by: rudeigerc <rudeigerc@gmail.com> * [fix] update bench_speculative.py for compatibility (sgl-project#7764) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * Move mem_fraction_static adjustment for multimodal models to `server_args.py` & Fix session control & Other cleanups (sgl-project#7748) * [RL] Add --nccl-port to prevent port conflict (sgl-project#7418) * [RL] add pause and continue generation for async rl training (sgl-project#7419) * [Fix] Alloc return type error (sgl-project#7778) Signed-off-by: Capronir <839972205@qq.com> * [feat] Support EAGLE3 for Qwen (sgl-project#7745) Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> * saving hidden_states.clone() (sgl-project#7705) * [1/n]: add cutlass W4A8 moe kernel for hopper architecture (sgl-project#7772) Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> * add model: qwen2-audio (sgl-project#7596) * Optimize Hopper CUTLASS FP8 Blockwise Grouped GEMM Kernel in Small K Scenario (sgl-project#7782) * Embedding parallel by attn_tp (sgl-project#7623) * fix: fix apply_shuffle_mul_sum (sgl-project#7444) * chore: bump sgl-kernel v0.2.3 (sgl-project#7784) * fix: use nvidia-nccl-cu12 2.27.5 (sgl-project#7787) * DP Attention with Auto DeepEP Dispatch (sgl-project#7222) * chore: upgrade sgl-kernel v0.2.3 (sgl-project#7786) * Fix incorrect spec_num_draft_tokens in draft_extend (sgl-project#7757) * [fix] fix misusing of is_cuda (sgl-project#7790) * Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (sgl-project#7756) Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> * chore: bump sgl-kernel v0.2.4 (sgl-project#7800) * ci: fix port args (sgl-project#7792) * Fix CI test OOM issue. (sgl-project#7799) * chore: upgrade sgl-kernel v0.2.4 (sgl-project#7801) * chore: bump v0.4.9 (sgl-project#7802) * fix merge conflict issue * fix hpu attention nonetyep issue * fix alignment * fix alignment2 * Ci failure fixes * fix attention-backend choices --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Signed-off-by: huanglong <huanglong@linux.alibaba.com> Signed-off-by: Ata Fatahi <immrata@gmail.com> Signed-off-by: keru <rukeyang@gmail.com> Signed-off-by: Tianyu Zhou <albert.zty@antgroup.com> Signed-off-by: rudeigerc <rudeigerc@gmail.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Capronir <839972205@qq.com> Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: KavioYu <67678385+yukavio@users.noreply.github.com> Co-authored-by: kavioyu <kavioyu@tencent.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: DiweiSun <105627594+DiweiSun@users.noreply.github.com> Co-authored-by: u4lr451 <u4lr451@gmail.com> Co-authored-by: austindeng <austindeng@tencent.com> Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Yijie Zhu <762412795@qq.com> Co-authored-by: 刁莹煜 <diaoyingyu1@hisilicon.com> Co-authored-by: Charles Chen <pychen96@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: sdp <sdp@gnr799219.jf.intel.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: linzhuo <15313137931lz@gmail.com> Co-authored-by: ch-tiger1 <tiger@ch-tech.ip-ddns.com> Co-authored-by: ch-tiger1 <xyz@ch-tech.ip-ddns.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Atream <80757050+Atream@users.noreply.github.com> Co-authored-by: Li Hui <lambert80.ios@gmail.com> Co-authored-by: Huang Long <121648372+LLLL114@users.noreply.github.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Ata Fatahi <immrata@gmail.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: Wenbo Yang <solrex@users.noreply.github.com> Co-authored-by: Chang Su <csu272@usc.edu> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Keyang Ru <rukeyang@gmail.com> Co-authored-by: ehuaa <ehuamail@163.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Sai Enduri <saimanas.enduri@amd.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: xutizhou <xutingz@nvidia.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Yuhong Guo <guoyuhong1985@outlook.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Alex Sun <alex.s@amd.com> Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com> Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: xianzhiT <xianzhitang@tencent.com> Co-authored-by: yilian49 <43861414+yilian49@users.noreply.github.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: ll819214 <18801269230@163.com> Co-authored-by: Li Junwen <lijunwen13@hisilicon.com> Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com> Co-authored-by: Hongbo Xu <1320612015@qq.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Meng, Peng <pengmeng@tencent.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: tarinkk <rt572@physics.rutger.edu> Co-authored-by: tarinkk <rt572@rutgers.physics.edu> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> Co-authored-by: Sheng Qi <shengqi2018@pku.edu.cn> Co-authored-by: finetune <82650881+finetunej@users.noreply.github.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Kan Wu <wukanustc@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Shenggui Li <somerlee.9@gmail.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Simon_CQK <cqk0100@gmail.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: 弋云 <yiyun.wyt@antgroup.com> Co-authored-by: walker-ai <2398833647@qq.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> Co-authored-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Co-authored-by: Albert <albert.zty@antgroup.com> Co-authored-by: Ziming Huang <1520787127@qq.com> Co-authored-by: ayrnb <70835312+ayrnb@users.noreply.github.com> Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: VDV1985 <vladdv85@mail.ru> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: TianyuZhang1214 <tianyuzhang1214@163.com> Co-authored-by: alpha-baby <fujianhao1997@qq.com> Co-authored-by: Yuchen Cheng <rudeigerc@gmail.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Caproni <40862361+Capronir@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: 纬杭 <ximing.wxm@antgroup.com> Co-authored-by: zyksir <zyksir@outlook.com> Co-authored-by: SijiaYang <yangsijia.614@bytedance.com> Co-authored-by: yicwang <yichen.wang@bytedance.com> Co-authored-by: Leng Yue <lengyue@lengyue.me> Co-authored-by: Qi Yuhang <45795032+HydraQYH@users.noreply.github.com> Co-authored-by: Gang Chen <13298548+MoonBall@users.noreply.github.com> Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com> Co-authored-by: jay <jthakur@habana.ai>

This comment was marked as outdated.

Sign in to view

JustinTong0323 force-pushed the refactor_oai_server_serving branch from a35b162 to 1d17465 Compare June 14, 2025 01:20

JustinTong0323 and others added 5 commits June 13, 2025 18:21

Merge branch 'main' into refactor_oai_server_serving

42bb560

feat: add serving_embedding

d9ceddd

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Simplifies getting enable_thinking value

5ddc8fc

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

JustinTong0323 marked this pull request as ready for review June 14, 2025 02:53

JustinTong0323 requested review from merrymercy, Ying1123, zhyncs and zhaochenyang20 as code owners June 14, 2025 02:53

This comment was marked as outdated.

Sign in to view

JustinTong0323 and others added 5 commits June 14, 2025 02:59

rename serving_engine to serving_base

2ddbb40

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Merge branch 'main' into refactor_oai_server_serving

26771ad

Update python/sglang/srt/entrypoints/openai/protocol.py

8ac4349

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

merrymercy and others added 2 commits June 14, 2025 06:07

Improve test cases for eagle infer (sgl-project#7173)

00b202c

fix CI

fb4ae05

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

JustinTong0323 force-pushed the refactor_oai_server_serving branch from 524d276 to fb4ae05 Compare June 14, 2025 06:09

Merge branch 'main' into refactor_oai_server_serving

81f5e41

JustinTong0323 requested a review from CatherineSue June 14, 2025 06:13

Merge branch 'main' into refactor_oai_server_serving

2a10db7

CatherineSue reviewed Jun 14, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

Removes unused utility functions

3b28fdb

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

JustinTong0323 force-pushed the refactor_oai_server_serving branch from 4d770f2 to d433e43 Compare June 15, 2025 04:06

CatherineSue reviewed Jun 15, 2025

View reviewed changes

CatherineSue mentioned this pull request Jun 15, 2025

OAI Server Skeleton & Core Utility Endpoints #7179

Merged

16 tasks

CatherineSue reviewed Jun 15, 2025

View reviewed changes

python/sglang/srt/entrypoints/openai/serving_chat.py Show resolved Hide resolved

python/sglang/srt/entrypoints/openai/serving_chat.py Outdated Show resolved Hide resolved

CatherineSue reviewed Jun 16, 2025

View reviewed changes

python/sglang/srt/entrypoints/openai/serving_completions.py Show resolved Hide resolved

CatherineSue mentioned this pull request Jun 16, 2025

[OAI Server Refactor] Implement Embeddings, Scoring, and Rerank #7107

Closed

JustinTong0323 and others added 11 commits June 16, 2025 08:28

Apply suggestions from code review

48586bf

Co-authored-by: Chang Su <csu272@usc.edu>

move the text = content["text"] in serving_chat for Better readability

69e41f7

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

lint

590db9a

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

remove redundant logic

4c140c8

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

logic for generate_completion_prompt

7190e6f

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Add comments back

40e97fc

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

Merge branch 'main' into refactor_oai_server_serving

84f6037

fix tests

b95a288

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

CatherineSue approved these changes Jun 16, 2025

View reviewed changes

fix lint

cc28f37

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

CatherineSue mentioned this pull request Jun 16, 2025

[OAI Server Refactor] Define Initial Serving Logic Structure #7104

Closed

Merge branch 'main' into refactor_oai_server_serving

ea30a8c

zhyncs merged commit 70c471a into sgl-project:main Jun 17, 2025
0 of 47 checks passed

coco-alen pushed a commit to jinleic/sglang that referenced this pull request Jun 20, 2025

[Refactor] OAI Server components (sgl-project#7167)

e96a8af

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>

JustinTong0323 deleted the refactor_oai_server_serving branch July 18, 2025 23:22

[Refactor] OAI Server components #7167

[Refactor] OAI Server components #7167

Uh oh!

Conversation

JustinTong0323 commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

JustinTong0323 commented Jun 15, 2025

Uh oh!

gemini-code-assist bot commented Jun 15, 2025

Summary of Changes

Highlights

Uh oh!

CatherineSue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JustinTong0323 commented Jun 16, 2025

Uh oh!

CatherineSue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jhinpan commented Jun 17, 2025

Uh oh!

Uh oh!

JustinTong0323 commented Jun 14, 2025 •

edited

Loading