support /v1/health using a generation 1 token #1154

LucienShui · 2024-08-19T16:39:25Z

Motivation

#853

Modification

Call generate in /heath.

Checklist

Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

python/sglang/srt/server.py

zhyncs · 2024-08-19T16:50:25Z

It would be better to write a test case for this URI to verify whether it meets expectations when it is genuinely available and when it is not.

LucienShui · 2024-08-19T17:03:05Z

Honestly, I tried, but I don't know how to simulate a failure that keeps the inference service down without affecting the web service to verify the healthcheck.

I tried causing exceptions manually, but the service quit immediately. Then I reverted the code to v0.2.5, then raise an Exception at the same line as #853, but it still quit immediately.

Maybe I need some help about the unittest.😢

zhyncs · 2024-08-19T17:04:59Z

Maybe you can use pytest.raises?

vhain · 2024-08-20T01:02:50Z

I needed this feature. We've been using POST /v1/completions {"model":" ","prompt":" ","max_tokens":1} for custom health check endpoint for autoscaler/autohealer at our infra. Once this merges, we can finally and reliably use GET /health (or /v1/health) for health checking.

One little concern is under high load, /health could timeout even if it is actually running. Maybe in the future, we can design sophisticated health check logic without needing to actually run the inference? but I think it's a good start.

LucienShui · 2024-08-20T01:46:22Z

One little concern is under high load, /health could timeout even if it is actually running.

@vhain Yeah, I have considered this. Maybe for healthcheck (livenessProbe in k8s), we could set a long timeout, and for load banlance ready check (readinessProbe in k8s), we could set a short timeout.

I thought that both healthcheck and ready check timeout should be determined based on various factors such as different model, devices and context lengths, which making it difficult to find a fixed value.

Could you please share some insights or experiences regarding timeout?

vhain · 2024-08-20T02:41:04Z

@LucienShui I think I need your insight as well 😅 I think we can set distinct timeout for each hardware config, and model settings as you mentioned.

In our case, we've been just constantly monitoring the latency for 1 token generation, and found the p95 to be around 5s, so we've set the timeout to 10s. However turned out sometimes it was not enough, so we just bumped up to 30s.

If it's taking more than 30s to generate 1 token, it's bad anyways, so we have the health check to trigger our autoscaler to scale out. If it's timing out for over threshold iterations (currently 2), we mark the instance unhealthy and replace it with the new one.

I think there is no right or wrong answer for LLM backend autoscaling/autohealing at the moment, as things are pretty in early stage.

vhain · 2024-08-20T03:43:02Z

@LucienShui I actually got an idea. Maybe we can add a periodic check (1 token generation; for every 10s or so - configurable by server args), running inside of the SGLang itself, cache the result, and use it for /health serving?

This way we can make sure that calling /health multiple times does not hurt, also it responds very quickly.

LucienShui · 2024-08-20T10:51:18Z

@vhain I have considered this approach, but it has the following issues:

The health check results might represent a historical health status under high load.
If let user control the internal loop interval, it would be better to let the user decide the health check interval directly.
Additionally, its implementation and runtime complexity are higher.

merrymercy · 2024-08-20T15:39:30Z

Great discussions! Can we merge this first and iterate on better designs in follow-up PRs?
For this PR, please create a separate endpoint and we can merge it.

…alth

python/sglang/srt/server.py

LucienShui · 2024-08-20T16:41:58Z

@merrymercy Yeah, sure! I'm glad it can be merged.

I'll trying to add some test cases in another PR.

python/sglang/srt/server.py

Co-authored-by: Yineng Zhang <me@zhyncs.com>

zhyncs · 2024-08-20T17:14:52Z

@LucienShui Thanks for your contribution!

…alth (sgl-project#1154) Co-authored-by: Yineng Zhang <me@zhyncs.com>

zhyncs reviewed Aug 19, 2024

View reviewed changes

python/sglang/srt/server.py Show resolved Hide resolved

python/sglang/srt/server.py Outdated Show resolved Hide resolved

python/sglang/srt/server.py Outdated Show resolved Hide resolved

LucienShui force-pushed the feature/health_check branch from 3de3c45 to 4ac993a Compare August 19, 2024 16:43

Generate 1 token to verify the health of the inference service in /he…

176d7cf

…alth

LucienShui force-pushed the feature/health_check branch from 00f494b to f6b64d9 Compare August 20, 2024 16:38

zhyncs reviewed Aug 20, 2024

View reviewed changes

python/sglang/srt/server.py Show resolved Hide resolved

Resolve comment

df328ce

LucienShui force-pushed the feature/health_check branch from f6b64d9 to df328ce Compare August 20, 2024 16:43

Resolve comments

f2bc731

zhyncs reviewed Aug 20, 2024

View reviewed changes

python/sglang/srt/server.py Outdated Show resolved Hide resolved

Update python/sglang/srt/server.py

d0d3431

Co-authored-by: Yineng Zhang <me@zhyncs.com>

zhyncs approved these changes Aug 20, 2024

View reviewed changes

zhyncs changed the title ~~Generate 1 token to verify the health of the inference service in /health~~ support /v1/health using a generation 1 token Aug 20, 2024

zhyncs merged commit 6242c39 into sgl-project:main Aug 20, 2024
1 of 5 checks passed

merrymercy mentioned this pull request Aug 22, 2024

[Minor] Improve logging and rename the health check endpoint name #1180

Merged

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Generate 1 token to verify the health of the inference service in /he…

f2bb2a3

…alth (sgl-project#1154) Co-authored-by: Yineng Zhang <me@zhyncs.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support /v1/health using a generation 1 token #1154

support /v1/health using a generation 1 token #1154

Uh oh!

LucienShui commented Aug 19, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Aug 19, 2024

Uh oh!

LucienShui commented Aug 19, 2024 •

edited

Loading

Uh oh!

zhyncs commented Aug 19, 2024

Uh oh!

vhain commented Aug 20, 2024

Uh oh!

LucienShui commented Aug 20, 2024 •

edited

Loading

Uh oh!

vhain commented Aug 20, 2024

Uh oh!

vhain commented Aug 20, 2024

Uh oh!

LucienShui commented Aug 20, 2024

Uh oh!

merrymercy commented Aug 20, 2024

Uh oh!

Uh oh!

LucienShui commented Aug 20, 2024

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Aug 20, 2024

Uh oh!

Uh oh!

support /v1/health using a generation 1 token #1154

support /v1/health using a generation 1 token #1154

Uh oh!

Conversation

LucienShui commented Aug 19, 2024

Motivation

Modification

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Aug 19, 2024

Uh oh!

LucienShui commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Aug 19, 2024

Uh oh!

vhain commented Aug 20, 2024

Uh oh!

LucienShui commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vhain commented Aug 20, 2024

Uh oh!

vhain commented Aug 20, 2024

Uh oh!

LucienShui commented Aug 20, 2024

Uh oh!

merrymercy commented Aug 20, 2024

Uh oh!

Uh oh!

LucienShui commented Aug 20, 2024

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Aug 20, 2024

Uh oh!

Uh oh!

LucienShui commented Aug 19, 2024 •

edited

Loading

LucienShui commented Aug 20, 2024 •

edited

Loading