Skip to content

Conversation

whybeyoung
Copy link
Collaborator

@whybeyoung whybeyoung commented Jul 17, 2025

Background​

The current health interface in sglang is a "fake" endpoint that simply returns a 200 status code. This poses significant stability issues in production environments, especially when integrated with cloud-native systems like Kubernetes. The lack of meaningful health status makes it impossible to accurately detect service anomalies or coordinate lifecycle management (e.g., restarts) in orchestrated environments.​

Proposed Solution​

We've designed a robust server status mechanism to address this gap:​

class ServerStatus(Enum):​
    Up = "Up"​
    Starting = "Starting"​
    UnHealthy = "UnHealthy"​
    Crashed = "Crashed"​
​
    def is_healthy(self) -> bool:​
        return self == ServerStatus.Up​

A service is considered healthy only when its status is Up. All other states indicate an unhealthy condition.​

State Transition Logic​

  1. Standalone PD (Prefill-Decode) Mode​
    Initial state: Starting (engine initialization phase)​
    Transitions to Up after:​
    HTTP server completes startup​
    Warm-up requests execute successfully​
    Transitions to Crashed if:​
    Scheduler or other critical subprocesses exit abnormally​
    Fatal errors occur during request processing​
  2. PD-Separated Mode​
    This mechanism becomes even more critical in distributed deployments:​
    Enables detection of hang/failure states across multiple Prefill (P) and Decode (D) nodes​
    Allows Kubernetes/LWS to automatically restart unhealthy replicas by monitoring the health endpoint​
    Prevents traffic from being routed to nodes in Starting/UnHealthy/Crashed states​

Rationale for Not Using health_generate​

The existing health_generate endpoint was deemed unsuitable because:​
Its response time is heavily influenced by batch queues in high-concurrency scenarios​
It introduces unnecessary computation overhead for health checks​
Fails to provide granular status information (e.g., distinguishing between "starting" and "crashed")​

Limitations and Future Work​

A known limitation is that engine-level hangs may not be detected by this mechanism. We plan to supplement this with:​
Periodic internal liveness probes​
Heartbeat monitoring between critical components​
Automatic state transition to UnHealthy on probe timeouts​
This change ensures sglang works reliably in cloud-native production environments while maintaining simplicity in the health check implementation.​

CC @ShangmingCai @ByronHsu @hnyls2002 @zhyncs

Signed-off-by: ybyang <ybyang7@iflytek.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @whybeyoung, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the health check mechanism from a static 200-status response to a dynamic, multi-state system. It introduces a ServerStatus enum to represent granular states like Up, Starting, UnHealthy, and Crashed, and integrates status updates throughout the server's lifecycle. This enhancement provides critical visibility into the server's operational state, enabling more reliable deployment and management in cloud-native environments.

Highlights

  • Enhanced Health Check Logic: Replaced the simplistic 200-status health check with a detailed ServerStatus enum (Up, Starting, UnHealthy, Crashed) to provide granular server state, crucial for cloud-native environments.
  • Dynamic Status Reporting: Introduced a new /health POST endpoint and a report_health utility function, enabling various components (engine, scheduler, HTTP server) to update the server's health status dynamically throughout its lifecycle.
  • Improved Lifecycle Management: Integrated status updates into critical server phases like startup, warmup, and error handling (e.g., child process crashes, scheduler exceptions), allowing for more accurate detection of service anomalies and better orchestration.
  • Refactored Health State Management: Replaced the boolean health_check_failed flag with the comprehensive ServerStatus enum, centralizing and standardizing health state management across the system for clarity and robustness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a more robust health check mechanism, which is a great improvement for production stability. The state transitions and overall logic are well-thought-out.

I've identified a few critical issues where the code would fail at runtime due to incorrect attribute access on a class instead of an instance. I've also suggested improvements for robustness in the report_health utility and error handling in the new /health POST endpoint.

Once these issues are addressed, this will be a solid feature.

Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: ybyang <ybyang7@iflytek.com>
@whybeyoung whybeyoung changed the title [Feature] Simple enhance for health check [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability Jul 17, 2025
@whybeyoung whybeyoung requested a review from CatherineSue as a code owner July 20, 2025 01:06
@zhyncs zhyncs merged commit 4540a46 into sgl-project:main Jul 20, 2025
1 of 53 checks passed
ch-wan pushed a commit that referenced this pull request Jul 23, 2025
…Stability (#8115)

Signed-off-by: ybyang <ybyang7@iflytek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants