Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled #7434

GaoYusong · 2025-06-22T08:20:47Z

Motivation

For retracted requests, the prefill length includes more than just len(req.origin_input_ids). Using len(req.origin_input_ids) in add_one_req_ignore_eos to determine the prefill length leads to incorrect calculation, and can cause RuntimeError: Prefill out of memory. Try to lower your batch size. It should be replaced with req.extend_input_len, which correctly reflects the actual prefill length.

2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547] Scheduler hit an exception: Traceback (most recent call last):^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/scheduler.py", line 2528, in run_scheduler_process^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     scheduler.event_loop_pp()^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     return func(*args, **kwargs)^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/scheduler.py", line 735, in event_loop_pp^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     mbs[mb_id] = self.get_next_batch_to_run()^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/scheduler.py", line 1356, in get_next_batch_to_run^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     new_batch = self.get_new_batch_prefill()^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/scheduler.py", line 1512, in get_new_batch_prefill^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     new_batch.prepare_for_extend()^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/schedule_batch.py", line 1240, in prepare_for_extend^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     out_cache_loc = self.alloc_token_slots(extend_num_tokens)^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/schedule_batch.py", line 973, in alloc_token_slots^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     raise RuntimeError(error_msg)^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547] RuntimeError: Prefill out of memory. Try to lower your batch size.^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547] Try to allocate 5736 tokens.^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547] Available tokens: 5504^M

Modifications

-        cur_rem_tokens = self.cur_rem_tokens - len(req.origin_input_ids)
+        cur_rem_tokens = self.cur_rem_tokens - req.extend_input_len

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…nore_eos is enabled For retracted requests, the prefill length includes more than just len(req.origin_input_ids). Using len(req.origin_input_ids) in add_one_req_ignore_eos to determine the prefill length leads to incorrect calculation, and can cause 'Prefill out of memory' runtime error. It should be replaced with req.extend_input_len, which correctly reflects the actual prefill length.

gemini-code-assist

Summary of Changes

Hello @GaoYusong, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical out-of-memory issue in the scheduler's prefill logic. By rectifying an incorrect token length calculation for specific request types, it enhances the system's stability and resource management, particularly when handling requests that require extended input processing.

Highlights

Bug Fix: Resolved a 'Prefill out of memory' RuntimeError that occurred for retracted requests when ignore_eos was enabled.
Memory Management: Corrected the calculation of required prefill tokens by switching from len(req.origin_input_ids) to req.extend_input_len.
Accuracy Improvement: Ensured that the system accurately accounts for the actual prefill length of retracted requests, preventing resource over-allocation or under-estimation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR fixes an out-of-memory issue when prefilling retracted requests with ignore_eos enabled by using req.extend_input_len to calculate token requirements. A unit test is suggested to prevent regressions.

python/sglang/srt/managers/schedule_policy.py

hnyls2002 · 2025-06-22T09:19:54Z

@GaoYusong It's kind of strange that the prefill OOM should not be bound by the lines you midified, but by https://github.com/sgl-project/sglang/pull/7434/files#diff-81361c87ec93558029686878365f86460cb0457ef2b55ef0f563057069094368R428

GaoYusong · 2025-06-24T15:13:11Z

I’ll update the PR with the new base ASAP

GaoYusong · 2025-06-25T15:35:20Z

@hnyls2002 Please take a look. Early exit you added should already address the issue, but I think it’s still worthwhile to make the freed tokens check more precise.

hnyls2002 · 2025-08-01T18:06:07Z

@GaoYusong You are right, let's merge this!

gemini-code-assist · 2025-08-02T06:43:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…nore_eos is enabled (#7434)

…nore_eos is enabled (sgl-project#7434)

GaoYusong requested review from merrymercy, Ying1123, hnyls2002 and xiezhq-hermann as code owners June 22, 2025 08:20

gemini-code-assist bot reviewed Jun 22, 2025

View reviewed changes

python/sglang/srt/managers/schedule_policy.py Outdated Show resolved Hide resolved

GaoYusong changed the title ~~fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled~~ Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled Jun 22, 2025

hnyls2002 self-assigned this Jun 22, 2025

hnyls2002 added the high priority label Jun 23, 2025

hnyls2002 assigned GaoYusong Jun 23, 2025

GaoYusong and others added 2 commits June 25, 2025 22:38

Merge branch 'sgl-project:main' into fix-prefill-out-of-memory

5b987cf

follow up on pr sgl-project#7397

5abb5e5

Merge branch 'main' into fix-prefill-out-of-memory

cce271b

hnyls2002 merged commit 4bec99e into sgl-project:main Aug 2, 2025
80 of 86 checks passed

lifuhuang pushed a commit that referenced this pull request Aug 3, 2025

Fix: resolve prefill of retracted request out-of-memory issue when ig…

3684c5c

…nore_eos is enabled (#7434)

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

Fix: resolve prefill of retracted request out-of-memory issue when ig…

d8e70a4

…nore_eos is enabled (#7434)

ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025

Fix: resolve prefill of retracted request out-of-memory issue when ig…

7e18581

…nore_eos is enabled (#7434)

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Fix: resolve prefill of retracted request out-of-memory issue when ig…

35788e7

…nore_eos is enabled (sgl-project#7434)

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025

Fix: resolve prefill of retracted request out-of-memory issue when ig…

55fc266

…nore_eos is enabled (sgl-project#7434)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled #7434

Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled #7434

Uh oh!

GaoYusong commented Jun 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

hnyls2002 commented Jun 22, 2025

Uh oh!

GaoYusong commented Jun 24, 2025

Uh oh!

GaoYusong commented Jun 25, 2025

Uh oh!

hnyls2002 commented Aug 1, 2025

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 2, 2025

Uh oh!

Uh oh!

Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled #7434

Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled #7434

Uh oh!

Conversation

GaoYusong commented Jun 22, 2025

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hnyls2002 commented Jun 22, 2025

Uh oh!

GaoYusong commented Jun 24, 2025

Uh oh!

GaoYusong commented Jun 25, 2025

Uh oh!

hnyls2002 commented Aug 1, 2025

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 2, 2025

Uh oh!

Uh oh!