Skip to content

Conversation

GaoYusong
Copy link
Contributor

Motivation

For retracted requests, the prefill length includes more than just len(req.origin_input_ids). Using len(req.origin_input_ids) in add_one_req_ignore_eos to determine the prefill length leads to incorrect calculation, and can cause RuntimeError: Prefill out of memory. Try to lower your batch size. It should be replaced with req.extend_input_len, which correctly reflects the actual prefill length.

2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547] Scheduler hit an exception: Traceback (most recent call last):^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/scheduler.py", line 2528, in run_scheduler_process^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     scheduler.event_loop_pp()^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     return func(*args, **kwargs)^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/scheduler.py", line 735, in event_loop_pp^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     mbs[mb_id] = self.get_next_batch_to_run()^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/scheduler.py", line 1356, in get_next_batch_to_run^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     new_batch = self.get_new_batch_prefill()^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/scheduler.py", line 1512, in get_new_batch_prefill^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     new_batch.prepare_for_extend()^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/schedule_batch.py", line 1240, in prepare_for_extend^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     out_cache_loc = self.alloc_token_slots(extend_num_tokens)^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]   File "/home/SGLang/python/sglang/srt/managers/schedule_batch.py", line 973, in alloc_token_slots^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547]     raise RuntimeError(error_msg)^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547] RuntimeError: Prefill out of memory. Try to lower your batch size.^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547] Try to allocate 5736 tokens.^M
2025-06-18 13:43:34 ERROR 31423 [ TP6 PP0 scheduler.py:2547] Available tokens: 5504^M

Modifications

-        cur_rem_tokens = self.cur_rem_tokens - len(req.origin_input_ids)
+        cur_rem_tokens = self.cur_rem_tokens - req.extend_input_len

Checklist

…nore_eos is enabled

For retracted requests, the prefill length includes more than just len(req.origin_input_ids).
Using len(req.origin_input_ids) in add_one_req_ignore_eos to determine the prefill length leads to incorrect calculation, and can cause
'Prefill out of memory' runtime error.
It should be replaced with req.extend_input_len, which correctly reflects the actual prefill length.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @GaoYusong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical out-of-memory issue in the scheduler's prefill logic. By rectifying an incorrect token length calculation for specific request types, it enhances the system's stability and resource management, particularly when handling requests that require extended input processing.

Highlights

  • Bug Fix: Resolved a 'Prefill out of memory' RuntimeError that occurred for retracted requests when ignore_eos was enabled.
  • Memory Management: Corrected the calculation of required prefill tokens by switching from len(req.origin_input_ids) to req.extend_input_len.
  • Accuracy Improvement: Ensured that the system accurately accounts for the actual prefill length of retracted requests, preventing resource over-allocation or under-estimation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR fixes an out-of-memory issue when prefilling retracted requests with ignore_eos enabled by using req.extend_input_len to calculate token requirements. A unit test is suggested to prevent regressions.

@GaoYusong GaoYusong changed the title fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled Jun 22, 2025
@hnyls2002
Copy link
Collaborator

@GaoYusong It's kind of strange that the prefill OOM should not be bound by the lines you midified, but by https://github.com/sgl-project/sglang/pull/7434/files#diff-81361c87ec93558029686878365f86460cb0457ef2b55ef0f563057069094368R428

@GaoYusong
Copy link
Contributor Author

I’ll update the PR with the new base ASAP

@GaoYusong
Copy link
Contributor Author

@hnyls2002 Please take a look. Early exit you added should already address the issue, but I think it’s still worthwhile to make the freed tokens check more precise.

@hnyls2002
Copy link
Collaborator

@GaoYusong You are right, let's merge this!

@hnyls2002 hnyls2002 merged commit 4bec99e into sgl-project:main Aug 2, 2025
80 of 86 checks passed
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

lifuhuang pushed a commit that referenced this pull request Aug 3, 2025
ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025
ShangmingCai pushed a commit that referenced this pull request Aug 5, 2025
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants