[Feature] [Roadmap] OpenAI-Compatible Server Refactor

## 1. Overview and Motivation

The current SGLang OpenAI-compatible API is integrated within the monolithic `http_server.py`. This design mixes native SGLang endpoints with OpenAI-compatible endpoints, making it difficult to maintain, extend, and debug. High request concurrency has also revealed potential latency bottlenecks within the `openai_api/adapter.py` layer.

The goal of this project is to refactor the OpenAI-compatible API into a new, self-contained, and modular server. This will improve maintainability, extensibility, and performance, drawing inspiration from the successful modular design of vLLM's OpenAI API server.

## 2. Proposed Design

We will create a new, dedicated module for the OpenAI-compatible server with a clear, extensible structure.

### 2.1. New Directory Structure

The new module will be located at `sglang/python/sglang/srt/entrypoints/openai/`:

```
sglang/
└── python/
    └── sglang/
        └── srt/
            ├── entrypoints/
            │   ├── http_server.py         # Existing native server (to be cleaned up)
            │   └── openai/                # New module for OpenAI server
            │       ├── __init__.py
            │       ├── api_server.py      # New OpenAI API server entrypoint
            │       ├── protocol.py        # OpenAI request/response models (moved)
            │       ├── utils.py           # Utilities (moved)
            │       ├── serving_chat.py    # Logic for /v1/chat/completions
            │       ├── serving_completion.py # Logic for /v1/completions
            │       ├── serving_embedding.py # Logic for /v1/embeddings
            │       └── ...                # Other serving modules as needed
            │
            └── openai_api/                # Existing module (to be deprecated)
                ├── adapter.py             # To be refactored and eventually deprecated
                └── ...
```

### 2.2. Components

*   **`api_server.py`**: The main entrypoint for the new server. It will be a lightweight FastAPI application that initializes the SGLang engine via a `lifespan` context manager and mounts the various OpenAI endpoints from the serving modules.
*   **`serving_*.py` files**: Each file will encapsulate the logic for a specific group of OpenAI API endpoints (e.g., `serving_chat.py` for chat completions). Common patterns and reusable logic may be refactored into a shared base class or utility module within the `entrypoints/openai/` directory to promote consistency and reduce code duplication as development progresses.
*   **`protocol.py`**: This file will continue to be the definitive source for the server's external API contract, containing the Pydantic models for all OpenAI API data structures, including SGLang-specific extensions.

### 2.3. API Endpoints

The new server will implement the following endpoints to achieve parity with the existing OpenAI-compatible API.

*   **Core Endpoints**:
    *   `GET /health`: Basic health check.
    *   `POST /health_generate`: Health check that confirms model generation.
    *   `GET /v1/models`: Lists the available models.
    *   `POST /v1/chat/completions`: Main endpoint for chat-based generation.
    *   `POST /v1/completions`: Main endpoint for text completion.
    *   `POST /v1/embeddings`: Endpoint for generating embeddings.
    *   `POST /v1/score`: Custom endpoint for scoring requests.
*   **File API Endpoints**:
    *   `POST /v1/files` (create)
    *   `GET /v1/files/{file_id}` (retrieve)
    *   `DELETE /v1/files/{file_id}` (delete)
    *   `GET /v1/files/{file_id}/content` (retrieve content)
*   **Batch API Endpoints**:
    *   `POST /v1/batches` (create)
    *   `GET /v1/batches/{batch_id}` (retrieve)
    *   `POST /v1/batches/{batch_id}/cancel` (cancel)

### 2.4. Handling Dependencies and Complex Features

To ensure both API flexibility and behavioral compatibility, the project will adopt a phased approach to dependencies:

*   **API Contract (`protocol.py`)**: This file will define the external API contract using SGLang's own Pydantic models, allowing for custom extensions. The class names (`ChatCompletionRequest`, etc.) will remain the same.
*   **Internal Processing**: The `openai` Python package will be introduced as a runtime dependency **only** when implementing features that require complex, standardized processing (e.g., Tool Calls). The internal logic (e.g., in `serving_chat.py`) will then use the official types from the `openai` package to ensure behavioral alignment with OpenAI's specification.

## 3. Profiling Existing Latency Issues

### 3.1. Problem Statement

There is an observation of high P99 latency when making requests through the OpenAI-compatible API path (`http_server.py` -> `adapter.py`) under high concurrency, compared to the native `/generate` endpoint. The goal is to identify the bottleneck within the `adapter.py` layer.

## 4. Phased Implementation Timeline

An accelerated three-week timeline is proposed, with specific, verifiable tasks for each phase.

### **Week 1: Foundational Server Setup**

**Goal**: Establish a functional, standalone API server with core health, model, and metrics endpoints.

*   **Task 1: Initialize Server Structure**
    *   Create the new directory structure (`sglang/python/sglang/srt/entrypoints/openai/`).
    *   Create a skeleton `api_server.py` with a FastAPI app instance.
    *   Move `protocol.py` and `utils.py` from the old `openai_api` directory to the new one.

*   **Task 2: Implement Core Utility Endpoints**
    *   In `api_server.py`, implement the `/health`, `/health_generate`, and `/v1/models` endpoints.

*   **Task 3: Implement Engine Lifecycle and Metrics**
    *   Implement the `lifespan` context manager in `api_server.py`.
    *   The `lifespan` startup logic will be responsible for initializing the SGLang engine (placeholder for now).
    *   **Optionally, unconditionally call `enable_func_timer()` from `sglang.srt.metrics.func_timer` and set up `add_prometheus_middleware(app)` within the `lifespan` startup. This will enable metrics globally and remove the need for `if enable_metrics:` checks throughout the codebase.** This can be a follow up after all tasks are done.

*   **Task 4: Define Initial Serving Logic Structure**
    *   Anticipating the development of `serving_*.py` modules in Week 2, define a preliminary structure for handling common request/response logic.
    *   This may involve outlining a base class (e.g., `OpenAIServingBase`) or a set of shared utility functions.
    *   Key considerations: request validation, interaction with the SGLang engine (to be passed from `api_server.py`), response formatting, and error handling.
    *   This task is foundational for ensuring consistency across different OpenAI endpoints and will be iteratively refined as serving modules are implemented.

### **Week 2: Core Endpoints**

**Goal**: Implement the primary OpenAI-compatible generation endpoints by refactoring logic from `adapter.py`.

*   **Task 5: Implement Chat Completions**
    *   Create `serving_chat.py`.
    *   Refactor the logic for `/v1/chat/completions` from `adapter.py`, including tool call support.
    *   Mount the endpoint in `api_server.py`.

*   **Task 6: Implement Embeddings & Scoring**
    *   Create `serving_embedding.py` and `serving_score.py`.
    *   Refactor the logic for `/v1/embeddings` and `/v1/score` from `adapter.py`.
    *   Mount the endpoints in `api_server.py`.

*   **Task 7: Implement Text Completions**
    *   Create `serving_completion.py`.
    *   Refactor the logic for `/v1/completions` from `adapter.py`.
    *   Mount the endpoint in `api_server.py`.

### **Week 3: Stateful Endpoints (Files & Batch API)**

**Goal**: Implement the more complex, stateful endpoints for file and batch processing.

~~*   **Task 8: Implement Files API**~~ (See [comment 7. Batch API support](https://github.com/sgl-project/sglang/issues/7068#issuecomment-2968270339) below)
    ~~*   Create `serving_file.py`.~~
    ~~*   Refactor the logic for all `/v1/files` endpoints.~~
    ~~*   Mount the new router in `api_server.py`.~~

~~*   **Task 9: Implement Batch API**~~ 
    ~~*   Create `serving_batch.py`.~~
    ~~*   Refactor the logic for all `/v1/batches` endpoints.~~
    ~~*   Mount the new router in `api_server.py`.~~

### **Week 1 & 2: Parallel Testing Strategy**

To ensure the refactored server maintains full API compatibility and avoids regressions, testing will be conducted in parallel with development. We will not modify the existing tests; instead, we will replicate their logic to run against our new server.

*   **Task 1: Create New Test Directory**
    *   A new directory will be created at `sglang/test/srt/openai/` to house all unit and integration tests for the new API server. This keeps the new test suite isolated from the legacy tests.

*   **Task 2: Implement a New Test Harness**
    *   A new `pytest` fixture will be created (e.g., in `sglang/test/srt/openai/conftest.py`).
    *   This fixture will be responsible for starting the new `api_server.py` in a background process, managing its configuration, and ensuring it is ready before tests run.
    *   It will mirror the functionality of the existing `popen_launch_server` helper but will be tailored to our new server's entrypoint and arguments.

*   **Task 3: Adapt and Validate Existing Tests**
    *   As each endpoint (e.g., Chat Completions) is implemented in the new server, the corresponding legacy test file (e.g., `test_openai_function_calling.py`) will be **copied** into the new `sglang/test/srt/openai/` directory.
    *   The copied test will be adapted to use the new test harness fixture instead of the old one.
    *   The core test logic (API request payloads and response assertions) will be kept identical.
    *   This will allow us to run the same tests against both the old and new servers, providing a direct and reliable way to verify that our refactored implementation is correct.

### **Post-Refactor Tasks**

*   **Hardening**: Finalize command-line argument parsing using the existing `server_args.py`.
*   **Deprecation**: Once the new server is stable and fully validated by the adapted tests, plan the formal deprecation and removal of the OpenAI-compatible endpoints from `http_server.py`.
*   **Support for Responses API**: Implement the OpenAI Responses API for more advanced interaction patterns. Reference: [https://platform.openai.com/docs/api-reference/responses](https://platform.openai.com/docs/api-reference/responses)
    *   `POST /v1/responses` (create)
    *   `GET /v1/responses/{response_id}`
    *   `GET /v1/responses/{response_id}/input_items`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] [Roadmap] OpenAI-Compatible Server Refactor #7068

1. Overview and Motivation

2. Proposed Design

2.1. New Directory Structure

2.2. Components

2.3. API Endpoints

2.4. Handling Dependencies and Complex Features

3. Profiling Existing Latency Issues

3.1. Problem Statement

4. Phased Implementation Timeline

Week 1: Foundational Server Setup

Week 2: Core Endpoints

Week 3: Stateful Endpoints (Files & Batch API)

Week 1 & 2: Parallel Testing Strategy

Post-Refactor Tasks

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] [Roadmap] OpenAI-Compatible Server Refactor #7068

Description

1. Overview and Motivation

2. Proposed Design

2.1. New Directory Structure

2.2. Components

2.3. API Endpoints

2.4. Handling Dependencies and Complex Features

3. Profiling Existing Latency Issues

3.1. Problem Statement

4. Phased Implementation Timeline

Week 1: Foundational Server Setup

Week 2: Core Endpoints

Week 3: Stateful Endpoints (Files & Batch API)

Week 1 & 2: Parallel Testing Strategy

Post-Refactor Tasks

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions