Skip to content

Conversation

LucienShui
Copy link
Contributor

Motivation

#853

Modification

Call generate in /heath.

Checklist

  • Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
  • Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  • Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  • Modify documentation as needed, such as docstrings or example tutorials.

@LucienShui LucienShui force-pushed the feature/health_check branch from 3de3c45 to 4ac993a Compare August 19, 2024 16:43
@zhyncs
Copy link
Member

zhyncs commented Aug 19, 2024

It would be better to write a test case for this URI to verify whether it meets expectations when it is genuinely available and when it is not.

@LucienShui
Copy link
Contributor Author

LucienShui commented Aug 19, 2024

Honestly, I tried, but I don't know how to simulate a failure that keeps the inference service down without affecting the web service to verify the healthcheck.

I tried causing exceptions manually, but the service quit immediately. Then I reverted the code to v0.2.5, then raise an Exception at the same line as #853, but it still quit immediately.

Maybe I need some help about the unittest.😢

@zhyncs
Copy link
Member

zhyncs commented Aug 19, 2024

Maybe you can use pytest.raises?

@vhain
Copy link
Contributor

vhain commented Aug 20, 2024

I needed this feature. We've been using POST /v1/completions {"model":" ","prompt":" ","max_tokens":1} for custom health check endpoint for autoscaler/autohealer at our infra. Once this merges, we can finally and reliably use GET /health (or /v1/health) for health checking.

One little concern is under high load, /health could timeout even if it is actually running. Maybe in the future, we can design sophisticated health check logic without needing to actually run the inference? but I think it's a good start.

@LucienShui
Copy link
Contributor Author

LucienShui commented Aug 20, 2024

One little concern is under high load, /health could timeout even if it is actually running.

@vhain Yeah, I have considered this. Maybe for healthcheck (livenessProbe in k8s), we could set a long timeout, and for load banlance ready check (readinessProbe in k8s), we could set a short timeout.

I thought that both healthcheck and ready check timeout should be determined based on various factors such as different model, devices and context lengths, which making it difficult to find a fixed value.

Could you please share some insights or experiences regarding timeout?

@vhain
Copy link
Contributor

vhain commented Aug 20, 2024

@LucienShui I think I need your insight as well 😅 I think we can set distinct timeout for each hardware config, and model settings as you mentioned.

In our case, we've been just constantly monitoring the latency for 1 token generation, and found the p95 to be around 5s, so we've set the timeout to 10s. However turned out sometimes it was not enough, so we just bumped up to 30s.

If it's taking more than 30s to generate 1 token, it's bad anyways, so we have the health check to trigger our autoscaler to scale out. If it's timing out for over threshold iterations (currently 2), we mark the instance unhealthy and replace it with the new one.

I think there is no right or wrong answer for LLM backend autoscaling/autohealing at the moment, as things are pretty in early stage.

@vhain
Copy link
Contributor

vhain commented Aug 20, 2024

@LucienShui I actually got an idea. Maybe we can add a periodic check (1 token generation; for every 10s or so - configurable by server args), running inside of the SGLang itself, cache the result, and use it for /health serving?

This way we can make sure that calling /health multiple times does not hurt, also it responds very quickly.

@LucienShui
Copy link
Contributor Author

@vhain I have considered this approach, but it has the following issues:

  1. The health check results might represent a historical health status under high load.
  2. If let user control the internal loop interval, it would be better to let the user decide the health check interval directly.
  3. Additionally, its implementation and runtime complexity are higher.

@merrymercy
Copy link
Contributor

Great discussions! Can we merge this first and iterate on better designs in follow-up PRs?
For this PR, please create a separate endpoint and we can merge it.

@LucienShui LucienShui force-pushed the feature/health_check branch from 00f494b to f6b64d9 Compare August 20, 2024 16:38
@LucienShui
Copy link
Contributor Author

@merrymercy Yeah, sure! I'm glad it can be merged.

I'll trying to add some test cases in another PR.

@LucienShui LucienShui force-pushed the feature/health_check branch from f6b64d9 to df328ce Compare August 20, 2024 16:43
Co-authored-by: Yineng Zhang <me@zhyncs.com>
@zhyncs zhyncs changed the title Generate 1 token to verify the health of the inference service in /health support /v1/health using a generation 1 token Aug 20, 2024
@zhyncs zhyncs merged commit 6242c39 into sgl-project:main Aug 20, 2024
1 of 5 checks passed
@zhyncs
Copy link
Member

zhyncs commented Aug 20, 2024

@LucienShui Thanks for your contribution!

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants