Skip to content

[Feature] [Roadmap] OpenAI-Compatible Server Refactor #7068

@CatherineSue

Description

@CatherineSue

1. Overview and Motivation

The current SGLang OpenAI-compatible API is integrated within the monolithic http_server.py. This design mixes native SGLang endpoints with OpenAI-compatible endpoints, making it difficult to maintain, extend, and debug. High request concurrency has also revealed potential latency bottlenecks within the openai_api/adapter.py layer.

The goal of this project is to refactor the OpenAI-compatible API into a new, self-contained, and modular server. This will improve maintainability, extensibility, and performance, drawing inspiration from the successful modular design of vLLM's OpenAI API server.

2. Proposed Design

We will create a new, dedicated module for the OpenAI-compatible server with a clear, extensible structure.

2.1. New Directory Structure

The new module will be located at sglang/python/sglang/srt/entrypoints/openai/:

sglang/
└── python/
    └── sglang/
        └── srt/
            ├── entrypoints/
            │   ├── http_server.py         # Existing native server (to be cleaned up)
            │   └── openai/                # New module for OpenAI server
            │       ├── __init__.py
            │       ├── api_server.py      # New OpenAI API server entrypoint
            │       ├── protocol.py        # OpenAI request/response models (moved)
            │       ├── utils.py           # Utilities (moved)
            │       ├── serving_chat.py    # Logic for /v1/chat/completions
            │       ├── serving_completion.py # Logic for /v1/completions
            │       ├── serving_embedding.py # Logic for /v1/embeddings
            │       └── ...                # Other serving modules as needed
            │
            └── openai_api/                # Existing module (to be deprecated)
                ├── adapter.py             # To be refactored and eventually deprecated
                └── ...

2.2. Components

  • api_server.py: The main entrypoint for the new server. It will be a lightweight FastAPI application that initializes the SGLang engine via a lifespan context manager and mounts the various OpenAI endpoints from the serving modules.
  • serving_*.py files: Each file will encapsulate the logic for a specific group of OpenAI API endpoints (e.g., serving_chat.py for chat completions). Common patterns and reusable logic may be refactored into a shared base class or utility module within the entrypoints/openai/ directory to promote consistency and reduce code duplication as development progresses.
  • protocol.py: This file will continue to be the definitive source for the server's external API contract, containing the Pydantic models for all OpenAI API data structures, including SGLang-specific extensions.

2.3. API Endpoints

The new server will implement the following endpoints to achieve parity with the existing OpenAI-compatible API.

  • Core Endpoints:
    • GET /health: Basic health check.
    • POST /health_generate: Health check that confirms model generation.
    • GET /v1/models: Lists the available models.
    • POST /v1/chat/completions: Main endpoint for chat-based generation.
    • POST /v1/completions: Main endpoint for text completion.
    • POST /v1/embeddings: Endpoint for generating embeddings.
    • POST /v1/score: Custom endpoint for scoring requests.
  • File API Endpoints:
    • POST /v1/files (create)
    • GET /v1/files/{file_id} (retrieve)
    • DELETE /v1/files/{file_id} (delete)
    • GET /v1/files/{file_id}/content (retrieve content)
  • Batch API Endpoints:
    • POST /v1/batches (create)
    • GET /v1/batches/{batch_id} (retrieve)
    • POST /v1/batches/{batch_id}/cancel (cancel)

2.4. Handling Dependencies and Complex Features

To ensure both API flexibility and behavioral compatibility, the project will adopt a phased approach to dependencies:

  • API Contract (protocol.py): This file will define the external API contract using SGLang's own Pydantic models, allowing for custom extensions. The class names (ChatCompletionRequest, etc.) will remain the same.
  • Internal Processing: The openai Python package will be introduced as a runtime dependency only when implementing features that require complex, standardized processing (e.g., Tool Calls). The internal logic (e.g., in serving_chat.py) will then use the official types from the openai package to ensure behavioral alignment with OpenAI's specification.

3. Profiling Existing Latency Issues

3.1. Problem Statement

There is an observation of high P99 latency when making requests through the OpenAI-compatible API path (http_server.py -> adapter.py) under high concurrency, compared to the native /generate endpoint. The goal is to identify the bottleneck within the adapter.py layer.

4. Phased Implementation Timeline

An accelerated three-week timeline is proposed, with specific, verifiable tasks for each phase.

Week 1: Foundational Server Setup

Goal: Establish a functional, standalone API server with core health, model, and metrics endpoints.

  • Task 1: Initialize Server Structure

    • Create the new directory structure (sglang/python/sglang/srt/entrypoints/openai/).
    • Create a skeleton api_server.py with a FastAPI app instance.
    • Move protocol.py and utils.py from the old openai_api directory to the new one.
  • Task 2: Implement Core Utility Endpoints

    • In api_server.py, implement the /health, /health_generate, and /v1/models endpoints.
  • Task 3: Implement Engine Lifecycle and Metrics

    • Implement the lifespan context manager in api_server.py.
    • The lifespan startup logic will be responsible for initializing the SGLang engine (placeholder for now).
    • Optionally, unconditionally call enable_func_timer() from sglang.srt.metrics.func_timer and set up add_prometheus_middleware(app) within the lifespan startup. This will enable metrics globally and remove the need for if enable_metrics: checks throughout the codebase. This can be a follow up after all tasks are done.
  • Task 4: Define Initial Serving Logic Structure

    • Anticipating the development of serving_*.py modules in Week 2, define a preliminary structure for handling common request/response logic.
    • This may involve outlining a base class (e.g., OpenAIServingBase) or a set of shared utility functions.
    • Key considerations: request validation, interaction with the SGLang engine (to be passed from api_server.py), response formatting, and error handling.
    • This task is foundational for ensuring consistency across different OpenAI endpoints and will be iteratively refined as serving modules are implemented.

Week 2: Core Endpoints

Goal: Implement the primary OpenAI-compatible generation endpoints by refactoring logic from adapter.py.

  • Task 5: Implement Chat Completions

    • Create serving_chat.py.
    • Refactor the logic for /v1/chat/completions from adapter.py, including tool call support.
    • Mount the endpoint in api_server.py.
  • Task 6: Implement Embeddings & Scoring

    • Create serving_embedding.py and serving_score.py.
    • Refactor the logic for /v1/embeddings and /v1/score from adapter.py.
    • Mount the endpoints in api_server.py.
  • Task 7: Implement Text Completions

    • Create serving_completion.py.
    • Refactor the logic for /v1/completions from adapter.py.
    • Mount the endpoint in api_server.py.

Week 3: Stateful Endpoints (Files & Batch API)

Goal: Implement the more complex, stateful endpoints for file and batch processing.

* Task 8: Implement Files API (See comment 7. Batch API support below)
* Create serving_file.py.
* Refactor the logic for all /v1/files endpoints.
* Mount the new router in api_server.py.

* Task 9: Implement Batch API
* Create serving_batch.py.
* Refactor the logic for all /v1/batches endpoints.
* Mount the new router in api_server.py.

Week 1 & 2: Parallel Testing Strategy

To ensure the refactored server maintains full API compatibility and avoids regressions, testing will be conducted in parallel with development. We will not modify the existing tests; instead, we will replicate their logic to run against our new server.

  • Task 1: Create New Test Directory

    • A new directory will be created at sglang/test/srt/openai/ to house all unit and integration tests for the new API server. This keeps the new test suite isolated from the legacy tests.
  • Task 2: Implement a New Test Harness

    • A new pytest fixture will be created (e.g., in sglang/test/srt/openai/conftest.py).
    • This fixture will be responsible for starting the new api_server.py in a background process, managing its configuration, and ensuring it is ready before tests run.
    • It will mirror the functionality of the existing popen_launch_server helper but will be tailored to our new server's entrypoint and arguments.
  • Task 3: Adapt and Validate Existing Tests

    • As each endpoint (e.g., Chat Completions) is implemented in the new server, the corresponding legacy test file (e.g., test_openai_function_calling.py) will be copied into the new sglang/test/srt/openai/ directory.
    • The copied test will be adapted to use the new test harness fixture instead of the old one.
    • The core test logic (API request payloads and response assertions) will be kept identical.
    • This will allow us to run the same tests against both the old and new servers, providing a direct and reliable way to verify that our refactored implementation is correct.

Post-Refactor Tasks

  • Hardening: Finalize command-line argument parsing using the existing server_args.py.
  • Deprecation: Once the new server is stable and fully validated by the adapted tests, plan the formal deprecation and removal of the OpenAI-compatible endpoints from http_server.py.
  • Support for Responses API: Implement the OpenAI Responses API for more advanced interaction patterns. Reference: https://platform.openai.com/docs/api-reference/responses
    • POST /v1/responses (create)
    • GET /v1/responses/{response_id}
    • GET /v1/responses/{response_id}/input_items

Sub-issues

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions