-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
1. Overview and Motivation
The current SGLang OpenAI-compatible API is integrated within the monolithic http_server.py
. This design mixes native SGLang endpoints with OpenAI-compatible endpoints, making it difficult to maintain, extend, and debug. High request concurrency has also revealed potential latency bottlenecks within the openai_api/adapter.py
layer.
The goal of this project is to refactor the OpenAI-compatible API into a new, self-contained, and modular server. This will improve maintainability, extensibility, and performance, drawing inspiration from the successful modular design of vLLM's OpenAI API server.
2. Proposed Design
We will create a new, dedicated module for the OpenAI-compatible server with a clear, extensible structure.
2.1. New Directory Structure
The new module will be located at sglang/python/sglang/srt/entrypoints/openai/
:
sglang/
└── python/
└── sglang/
└── srt/
├── entrypoints/
│ ├── http_server.py # Existing native server (to be cleaned up)
│ └── openai/ # New module for OpenAI server
│ ├── __init__.py
│ ├── api_server.py # New OpenAI API server entrypoint
│ ├── protocol.py # OpenAI request/response models (moved)
│ ├── utils.py # Utilities (moved)
│ ├── serving_chat.py # Logic for /v1/chat/completions
│ ├── serving_completion.py # Logic for /v1/completions
│ ├── serving_embedding.py # Logic for /v1/embeddings
│ └── ... # Other serving modules as needed
│
└── openai_api/ # Existing module (to be deprecated)
├── adapter.py # To be refactored and eventually deprecated
└── ...
2.2. Components
api_server.py
: The main entrypoint for the new server. It will be a lightweight FastAPI application that initializes the SGLang engine via alifespan
context manager and mounts the various OpenAI endpoints from the serving modules.serving_*.py
files: Each file will encapsulate the logic for a specific group of OpenAI API endpoints (e.g.,serving_chat.py
for chat completions). Common patterns and reusable logic may be refactored into a shared base class or utility module within theentrypoints/openai/
directory to promote consistency and reduce code duplication as development progresses.protocol.py
: This file will continue to be the definitive source for the server's external API contract, containing the Pydantic models for all OpenAI API data structures, including SGLang-specific extensions.
2.3. API Endpoints
The new server will implement the following endpoints to achieve parity with the existing OpenAI-compatible API.
- Core Endpoints:
GET /health
: Basic health check.POST /health_generate
: Health check that confirms model generation.GET /v1/models
: Lists the available models.POST /v1/chat/completions
: Main endpoint for chat-based generation.POST /v1/completions
: Main endpoint for text completion.POST /v1/embeddings
: Endpoint for generating embeddings.POST /v1/score
: Custom endpoint for scoring requests.
- File API Endpoints:
POST /v1/files
(create)GET /v1/files/{file_id}
(retrieve)DELETE /v1/files/{file_id}
(delete)GET /v1/files/{file_id}/content
(retrieve content)
- Batch API Endpoints:
POST /v1/batches
(create)GET /v1/batches/{batch_id}
(retrieve)POST /v1/batches/{batch_id}/cancel
(cancel)
2.4. Handling Dependencies and Complex Features
To ensure both API flexibility and behavioral compatibility, the project will adopt a phased approach to dependencies:
- API Contract (
protocol.py
): This file will define the external API contract using SGLang's own Pydantic models, allowing for custom extensions. The class names (ChatCompletionRequest
, etc.) will remain the same. - Internal Processing: The
openai
Python package will be introduced as a runtime dependency only when implementing features that require complex, standardized processing (e.g., Tool Calls). The internal logic (e.g., inserving_chat.py
) will then use the official types from theopenai
package to ensure behavioral alignment with OpenAI's specification.
3. Profiling Existing Latency Issues
3.1. Problem Statement
There is an observation of high P99 latency when making requests through the OpenAI-compatible API path (http_server.py
-> adapter.py
) under high concurrency, compared to the native /generate
endpoint. The goal is to identify the bottleneck within the adapter.py
layer.
4. Phased Implementation Timeline
An accelerated three-week timeline is proposed, with specific, verifiable tasks for each phase.
Week 1: Foundational Server Setup
Goal: Establish a functional, standalone API server with core health, model, and metrics endpoints.
-
Task 1: Initialize Server Structure
- Create the new directory structure (
sglang/python/sglang/srt/entrypoints/openai/
). - Create a skeleton
api_server.py
with a FastAPI app instance. - Move
protocol.py
andutils.py
from the oldopenai_api
directory to the new one.
- Create the new directory structure (
-
Task 2: Implement Core Utility Endpoints
- In
api_server.py
, implement the/health
,/health_generate
, and/v1/models
endpoints.
- In
-
Task 3: Implement Engine Lifecycle and Metrics
- Implement the
lifespan
context manager inapi_server.py
. - The
lifespan
startup logic will be responsible for initializing the SGLang engine (placeholder for now). - Optionally, unconditionally call
enable_func_timer()
fromsglang.srt.metrics.func_timer
and set upadd_prometheus_middleware(app)
within thelifespan
startup. This will enable metrics globally and remove the need forif enable_metrics:
checks throughout the codebase. This can be a follow up after all tasks are done.
- Implement the
-
Task 4: Define Initial Serving Logic Structure
- Anticipating the development of
serving_*.py
modules in Week 2, define a preliminary structure for handling common request/response logic. - This may involve outlining a base class (e.g.,
OpenAIServingBase
) or a set of shared utility functions. - Key considerations: request validation, interaction with the SGLang engine (to be passed from
api_server.py
), response formatting, and error handling. - This task is foundational for ensuring consistency across different OpenAI endpoints and will be iteratively refined as serving modules are implemented.
- Anticipating the development of
Week 2: Core Endpoints
Goal: Implement the primary OpenAI-compatible generation endpoints by refactoring logic from adapter.py
.
-
Task 5: Implement Chat Completions
- Create
serving_chat.py
. - Refactor the logic for
/v1/chat/completions
fromadapter.py
, including tool call support. - Mount the endpoint in
api_server.py
.
- Create
-
Task 6: Implement Embeddings & Scoring
- Create
serving_embedding.py
andserving_score.py
. - Refactor the logic for
/v1/embeddings
and/v1/score
fromadapter.py
. - Mount the endpoints in
api_server.py
.
- Create
-
Task 7: Implement Text Completions
- Create
serving_completion.py
. - Refactor the logic for
/v1/completions
fromadapter.py
. - Mount the endpoint in
api_server.py
.
- Create
Week 3: Stateful Endpoints (Files & Batch API)
Goal: Implement the more complex, stateful endpoints for file and batch processing.
* Task 8: Implement Files API (See comment 7. Batch API support below)
* Create serving_file.py
.
* Refactor the logic for all /v1/files
endpoints.
* Mount the new router in api_server.py
.
* Task 9: Implement Batch API
* Create serving_batch.py
.
* Refactor the logic for all /v1/batches
endpoints.
* Mount the new router in api_server.py
.
Week 1 & 2: Parallel Testing Strategy
To ensure the refactored server maintains full API compatibility and avoids regressions, testing will be conducted in parallel with development. We will not modify the existing tests; instead, we will replicate their logic to run against our new server.
-
Task 1: Create New Test Directory
- A new directory will be created at
sglang/test/srt/openai/
to house all unit and integration tests for the new API server. This keeps the new test suite isolated from the legacy tests.
- A new directory will be created at
-
Task 2: Implement a New Test Harness
- A new
pytest
fixture will be created (e.g., insglang/test/srt/openai/conftest.py
). - This fixture will be responsible for starting the new
api_server.py
in a background process, managing its configuration, and ensuring it is ready before tests run. - It will mirror the functionality of the existing
popen_launch_server
helper but will be tailored to our new server's entrypoint and arguments.
- A new
-
Task 3: Adapt and Validate Existing Tests
- As each endpoint (e.g., Chat Completions) is implemented in the new server, the corresponding legacy test file (e.g.,
test_openai_function_calling.py
) will be copied into the newsglang/test/srt/openai/
directory. - The copied test will be adapted to use the new test harness fixture instead of the old one.
- The core test logic (API request payloads and response assertions) will be kept identical.
- This will allow us to run the same tests against both the old and new servers, providing a direct and reliable way to verify that our refactored implementation is correct.
- As each endpoint (e.g., Chat Completions) is implemented in the new server, the corresponding legacy test file (e.g.,
Post-Refactor Tasks
- Hardening: Finalize command-line argument parsing using the existing
server_args.py
. - Deprecation: Once the new server is stable and fully validated by the adapted tests, plan the formal deprecation and removal of the OpenAI-compatible endpoints from
http_server.py
. - Support for Responses API: Implement the OpenAI Responses API for more advanced interaction patterns. Reference: https://platform.openai.com/docs/api-reference/responses
POST /v1/responses
(create)GET /v1/responses/{response_id}
GET /v1/responses/{response_id}/input_items