-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled #7434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled #7434
Conversation
…nore_eos is enabled For retracted requests, the prefill length includes more than just len(req.origin_input_ids). Using len(req.origin_input_ids) in add_one_req_ignore_eos to determine the prefill length leads to incorrect calculation, and can cause 'Prefill out of memory' runtime error. It should be replaced with req.extend_input_len, which correctly reflects the actual prefill length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @GaoYusong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a critical out-of-memory issue in the scheduler's prefill logic. By rectifying an incorrect token length calculation for specific request types, it enhances the system's stability and resource management, particularly when handling requests that require extended input processing.
Highlights
- Bug Fix: Resolved a 'Prefill out of memory'
RuntimeError
that occurred for retracted requests whenignore_eos
was enabled. - Memory Management: Corrected the calculation of required prefill tokens by switching from
len(req.origin_input_ids)
toreq.extend_input_len
. - Accuracy Improvement: Ensured that the system accurately accounts for the actual prefill length of retracted requests, preventing resource over-allocation or under-estimation.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR fixes an out-of-memory issue when prefilling retracted requests with ignore_eos
enabled by using req.extend_input_len
to calculate token requirements. A unit test is suggested to prevent regressions.
@GaoYusong It's kind of strange that the prefill OOM should not be bound by the lines you midified, but by https://github.com/sgl-project/sglang/pull/7434/files#diff-81361c87ec93558029686878365f86460cb0457ef2b55ef0f563057069094368R428 |
I’ll update the PR with the new base ASAP |
@hnyls2002 Please take a look. Early exit you added should already address the issue, but I think it’s still worthwhile to make the freed tokens check more precise. |
@GaoYusong You are right, let's merge this! |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…nore_eos is enabled (#7434)
…nore_eos is enabled (#7434)
…nore_eos is enabled (#7434)
…nore_eos is enabled (sgl-project#7434)
…nore_eos is enabled (sgl-project#7434)
Motivation
For retracted requests, the prefill length includes more than just
len(req.origin_input_ids)
. Usinglen(req.origin_input_ids)
inadd_one_req_ignore_eos
to determine the prefill length leads to incorrect calculation, and can cause RuntimeError: Prefill out of memory. Try to lower your batch size. It should be replaced withreq.extend_input_len
, which correctly reflects the actual prefill length.Modifications
Checklist