-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[Improvements] Merge health check route #8444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @whybeyoung, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly refactors the server's health check mechanism to provide a more robust and accurate indication of its operational status. It unifies the existing /health
and /health_generate
endpoints under a single, intelligent logic that considers the server's internal state, particularly during startup and warmup phases. This enhancement aims to improve compatibility with external systems like Kubernetes probes, ensuring traffic is only routed to a fully initialized and capable engine.
Highlights
- Unified Health Check Endpoint: The
/health
and/health_generate
endpoints are now merged, with both routes pointing to the same underlying logic. This simplifies health check configuration and ensures consistency. - Server Status Tracking: A new
ServerStatus
enum (Starting
,Up
,UnHealthy
,Crashed
) has been introduced insglang/srt/utils.py
to track the server's operational state. TheTokenizerManager
now maintains this status. - Enhanced Health Check Logic: The health check logic in
http_server.py
now performs a two-step verification: first, it checks the internalserver_status
, returning a 503 if the server is notUp
. Only if the server isUp
does it proceed with the token generation test, providing a more accurate reflection of the server's readiness and capability. - Warmup Status Integration: The server's
ServerStatus
is now updated during the warmup process in_execute_server_warmup
. Upon successful warmup, the status is set toUp
; if warmup fails, it's set toUnHealthy
. This ensures that the health check accurately reflects the server's readiness post-startup, crucial for Kubernetes probes.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request merges the /health and /health_generate endpoints, introducing a two-step health check. The changes include a new ServerStatus enum and its integration into the server lifecycle. The review suggests improvements in code clarity, debuggability, and consistency.
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: ybyang <ybyang7@iflytek.com>
Signed-off-by: ybyang <ybyang7@iflytek.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Kan Wu <wukanustc@gmail.com>
Signed-off-by: ybyang <ybyang7@iflytek.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Kan Wu <wukanustc@gmail.com>
Signed-off-by: ybyang <ybyang7@iflytek.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Kan Wu <wukanustc@gmail.com>
Signed-off-by: ybyang <ybyang7@iflytek.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Kan Wu <wukanustc@gmail.com>
Signed-off-by: ybyang <ybyang7@iflytek.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Kan Wu <wukanustc@gmail.com>
Although PR #8115 (which introduced a refined health check mechanism) has been reverted, the core issues it aimed to address—inadequate health monitoring in production and cloud-native environments—remain unresolved.
Merge the routing logic of the original /health and /health_generate endpoints. To maintain compatibility, both routes will be retained but will follow the same logic:
Step 1: When sglang starts, initialize the state as Starting. Once the warmup request succeeds, set the state to Healthy.
Step 2: Only when the state is Healthy will the system intelligently determine whether it can successfully generate a token based on the load situation to judge if the current server is healthy.
This PR, combined with Kubernetes probes, ensures that traffic is only routed to the engine after it has fully started, and the health_generate logic will only be executed when the engine is running normally.
CC @ByronHsu @merrymercy