-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Open AI API hidden states #6716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…statement cleanup
…ng the preference of the spec decoder in multiple places. we need to move this to a completely engine-level parameter to fix.
… prior to testing on H100
Done! |
@kyle-pena-kuzco Nice. @CatherineSue will help review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall changes in openai_api and tests LGTM. Left some comments and questions for better clarification.
May need another pass on the speculative side changes.
Thank you @CatherineSue - I am addressing your feedback now! |
Agree with this simplification. Co-authored-by: Chang Su <csu272@usc.edu>
… into openai-hidden-states
@CatherineSue - I think I have addressed your feedback with my latest commits. Please let me know if there is anything else I can do or change. I really appreciate your consideration. |
Hi @CatherineSue - let me know if you have any questions! I think i've addressed all of the feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kyle-pena-kuzco Thanks for resolving all the comments. LGTM. If all CI passes I think we are good to go
@CatherineSue - it looks like there are merge conflicts now. i'll fix these today. |
hi @CatherineSue - let me know what you think of the run results. It's difficult for me to tell if these are genuine regressions because I see other commits / PRs with similar failures. |
Hi @CatherineSue - is there anything else we should do in order for this PR to be merged? I see that gemini might be blocking the merge. |
Hi @kyle-pena-kuzco, thanks for the update. Regarding your question about My role is to provide feedback and a review status based on the code changes when a review is requested or triggered. If you would like me to perform a review of the latest changes to provide feedback, please let me know. Otherwise, you might want to check the status of other required checks or reviews listed on the pull request page to see what might be pending. |
cc @zhyncs The CI looks good. Should we merge? |
@CatherineSue - as an FYI, I have the merge conflict resolved locally. i'm waiting to push it up until @zhyncs has had a chance to review, so that the CI doesn't re-run. |
@BBuf I've resolved another conflict and pushed. Let me know if any CI failures are something that I should look into. It's difficult to tell which CI failures are genuine. |
@kyle-pena-kuzco Good job!!! But why it would not return the hidden_state of last token? "return_hidden_states" for me means return all hidden_states. |
If you are interacting the API, it's not practical to return all hidden states because this would involve serializing something like 30MB of raw float data for a typical model / response length. You can still get all tokens if you interact with the engine directly (instead of through the API). |
Motivation
This PR exposes hidden states through the OpenAI API.
Hidden states can now be returned by including
return_hidden_states
in the request body.This feature is enabled by supplying the flag
--enable-return-hidden-states
.The test failure noted in the last submission of this PR is resolved, as explained below.
This feature supports all varieties of OpenAI generation requests:
The previous implementation would re-compile the CUDA graph every time the user switched between
return_hidden_states
True and False at the request level.This behavior is resolved - changing
return_hidden_states
will no longer trigger a CUDA graph re-capture.Modifications
CaptureHiddenMode.FULL
on startup (so that a double-capture isn't needed).TARGET_VERIFY
runs withCaptureHiddenMode.FULL
, andTARGET_EXTEND
does not use the CUDA graph runner.Checklist