-
Notifications
You must be signed in to change notification settings - Fork 2.9k
support /v1/health using a generation 1 token #1154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3de3c45
to
4ac993a
Compare
It would be better to write a test case for this URI to verify whether it meets expectations when it is genuinely available and when it is not. |
Honestly, I tried, but I don't know how to simulate a failure that keeps the inference service down without affecting the web service to verify the healthcheck. I tried causing exceptions manually, but the service quit immediately. Then I reverted the code to v0.2.5, then raise an Exception at the same line as #853, but it still quit immediately. Maybe I need some help about the unittest.😢 |
Maybe you can use |
I needed this feature. We've been using One little concern is under high load, |
@vhain Yeah, I have considered this. Maybe for healthcheck (livenessProbe in k8s), we could set a long timeout, and for load banlance ready check (readinessProbe in k8s), we could set a short timeout. I thought that both healthcheck and ready check timeout should be determined based on various factors such as different model, devices and context lengths, which making it difficult to find a fixed value. Could you please share some insights or experiences regarding timeout? |
@LucienShui I think I need your insight as well 😅 I think we can set distinct timeout for each hardware config, and model settings as you mentioned. In our case, we've been just constantly monitoring the latency for 1 token generation, and found the p95 to be around 5s, so we've set the timeout to 10s. However turned out sometimes it was not enough, so we just bumped up to 30s. If it's taking more than 30s to generate 1 token, it's bad anyways, so we have the health check to trigger our autoscaler to scale out. If it's timing out for over threshold iterations (currently 2), we mark the instance unhealthy and replace it with the new one. I think there is no right or wrong answer for LLM backend autoscaling/autohealing at the moment, as things are pretty in early stage. |
@LucienShui I actually got an idea. Maybe we can add a periodic check (1 token generation; for every 10s or so - configurable by server args), running inside of the SGLang itself, cache the result, and use it for This way we can make sure that calling |
@vhain I have considered this approach, but it has the following issues:
|
Great discussions! Can we merge this first and iterate on better designs in follow-up PRs? |
00f494b
to
f6b64d9
Compare
@merrymercy Yeah, sure! I'm glad it can be merged. I'll trying to add some test cases in another PR. |
f6b64d9
to
df328ce
Compare
Co-authored-by: Yineng Zhang <me@zhyncs.com>
@LucienShui Thanks for your contribution! |
…alth (sgl-project#1154) Co-authored-by: Yineng Zhang <me@zhyncs.com>
Motivation
#853
Modification
Call generate in
/heath
.Checklist
pre-commit run --all-files
or other linting tools are used to fix potential lint issues.