[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability #8115

whybeyoung · 2025-07-17T08:23:50Z

Background

The current health interface in sglang is a "fake" endpoint that simply returns a 200 status code. This poses significant stability issues in production environments, especially when integrated with cloud-native systems like Kubernetes. The lack of meaningful health status makes it impossible to accurately detect service anomalies or coordinate lifecycle management (e.g., restarts) in orchestrated environments.

Proposed Solution

We've designed a robust server status mechanism to address this gap:

class ServerStatus(Enum):
    Up = "Up"
    Starting = "Starting"
    UnHealthy = "UnHealthy"
    Crashed = "Crashed"

    def is_healthy(self) -> bool:
        return self == ServerStatus.Up

A service is considered healthy only when its status is Up. All other states indicate an unhealthy condition.

State Transition Logic

Standalone PD (Prefill-Decode) Mode
Initial state: Starting (engine initialization phase)
Transitions to Up after:
HTTP server completes startup
Warm-up requests execute successfully
Transitions to Crashed if:
Scheduler or other critical subprocesses exit abnormally
Fatal errors occur during request processing
PD-Separated Mode
This mechanism becomes even more critical in distributed deployments:
Enables detection of hang/failure states across multiple Prefill (P) and Decode (D) nodes
Allows Kubernetes/LWS to automatically restart unhealthy replicas by monitoring the health endpoint
Prevents traffic from being routed to nodes in Starting/UnHealthy/Crashed states

Rationale for Not Using health_generate

The existing health_generate endpoint was deemed unsuitable because:
Its response time is heavily influenced by batch queues in high-concurrency scenarios
It introduces unnecessary computation overhead for health checks
Fails to provide granular status information (e.g., distinguishing between "starting" and "crashed")

Limitations and Future Work

A known limitation is that engine-level hangs may not be detected by this mechanism. We plan to supplement this with:
Periodic internal liveness probes
Heartbeat monitoring between critical components
Automatic state transition to UnHealthy on probe timeouts
This change ensures sglang works reliably in cloud-native production environments while maintaining simplicity in the health check implementation.

CC @ShangmingCai @ByronHsu @hnyls2002 @zhyncs

Signed-off-by: ybyang <ybyang7@iflytek.com>

gemini-code-assist

Summary of Changes

Hello @whybeyoung, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the health check mechanism from a static 200-status response to a dynamic, multi-state system. It introduces a ServerStatus enum to represent granular states like Up, Starting, UnHealthy, and Crashed, and integrates status updates throughout the server's lifecycle. This enhancement provides critical visibility into the server's operational state, enabling more reliable deployment and management in cloud-native environments.

Highlights

Enhanced Health Check Logic: Replaced the simplistic 200-status health check with a detailed ServerStatus enum (Up, Starting, UnHealthy, Crashed) to provide granular server state, crucial for cloud-native environments.
Dynamic Status Reporting: Introduced a new /health POST endpoint and a report_health utility function, enabling various components (engine, scheduler, HTTP server) to update the server's health status dynamically throughout its lifecycle.
Improved Lifecycle Management: Integrated status updates into critical server phases like startup, warmup, and error handling (e.g., child process crashes, scheduler exceptions), allowing for more accurate detection of service anomalies and better orchestration.
Refactored Health State Management: Replaced the boolean health_check_failed flag with the comprehensive ServerStatus enum, centralizing and standardizing health state management across the system for clarity and robustness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a more robust health check mechanism, which is a great improvement for production stability. The state transitions and overall logic are well-thought-out.

I've identified a few critical issues where the code would fail at runtime due to incorrect attribute access on a class instead of an instance. I've also suggested improvements for robustness in the report_health utility and error handling in the new /health POST endpoint.

Once these issues are addressed, this will be a solid feature.

python/sglang/srt/entrypoints/engine.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/entrypoints/http_server.py

python/sglang/srt/utils.py

Signed-off-by: ybyang <ybyang7@iflytek.com>

…Stability (#8115) Signed-off-by: ybyang <ybyang7@iflytek.com>

[Feature] Simple enhance for health check

a5ef25f

Signed-off-by: ybyang <ybyang7@iflytek.com>

whybeyoung requested review from merrymercy, Ying1123, hnyls2002, xiezhq-hermann, zhaochenyang20, zhyncs, ispobock and ByronHsu as code owners July 17, 2025 08:23

gemini-code-assist bot reviewed Jul 17, 2025

View reviewed changes

whybeyoung added 2 commits July 17, 2025 16:26

[Feature] Simple enhance for health check

0dbdcf6

Signed-off-by: ybyang <ybyang7@iflytek.com>

[Feature] Simple enhance for health check

1ba6bd8

Signed-off-by: ybyang <ybyang7@iflytek.com>

whybeyoung changed the title ~~[Feature] Simple enhance for health check~~ [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability Jul 17, 2025

Merge branch 'main' into feature/health_status

e3c9628

slin1237 approved these changes Jul 17, 2025

View reviewed changes

zhyncs added 2 commits July 18, 2025 22:02

Merge branch 'main' into feature/health_status

7c5138b

upd

5ec603a

zhyncs approved these changes Jul 19, 2025

View reviewed changes

zhyncs added the high priority label Jul 19, 2025

Merge branch 'main' into feature/health_status

49c4df4

whybeyoung requested a review from CatherineSue as a code owner July 20, 2025 01:06

zhyncs merged commit 4540a46 into sgl-project:main Jul 20, 2025
1 of 53 checks passed

merrymercy mentioned this pull request Jul 20, 2025

Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" #8181

Merged

ch-wan pushed a commit that referenced this pull request Jul 23, 2025

[Feature] Simple Improve Health Check Mechanism for Production-Grade …

beddf1e

…Stability (#8115) Signed-off-by: ybyang <ybyang7@iflytek.com>

whybeyoung mentioned this pull request Jul 28, 2025

[Improvements] Merge health check route #8444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability #8115

[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability #8115

Uh oh!

whybeyoung commented Jul 17, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability #8115

[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability #8115

Uh oh!

Conversation

whybeyoung commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background​

Proposed Solution​

State Transition Logic​

Rationale for Not Using health_generate​

Limitations and Future Work​

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whybeyoung commented Jul 17, 2025 •

edited

Loading

Background

Proposed Solution

State Transition Logic

Rationale for Not Using health_generate

Limitations and Future Work