-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
[2025-06-18 18:11:03] Prefill batch. #new-seq: 14, #new-token: 8192, #cached-token: 0, token usage: 0.92, #unbootstrapped-req: 0, #queue-req: 105, #transferring-req: 990, input throughput (token/s): 96875.08
[2025-06-18 18:11:03] Prefill batch. #new-seq: 18, #new-token: 8192, #cached-token: 0, token usage: 0.94, #unbootstrapped-req: 0, #queue-req: 88, #transferring-req: 1004, input throughput (token/s): 98205.90
[2025-06-18 18:11:03] Prefill batch. #new-seq: 13, #new-token: 8192, #cached-token: 0, token usage: 0.95, #unbootstrapped-req: 0, #queue-req: 76, #transferring-req: 1015, input throughput (token/s): 85849.56
[2025-06-18 18:11:03] Prefill batch. #new-seq: 18, #new-token: 8192, #cached-token: 0, token usage: 0.96, #unbootstrapped-req: 0, #queue-req: 59, #transferring-req: 1029, input throughput (token/s): 105483.90
[2025-06-18 18:11:04] Prefill batch. #new-seq: 18, #new-token: 8192, #cached-token: 0, token usage: 0.98, #unbootstrapped-req: 0, #queue-req: 42, #transferring-req: 1041, input throughput (token/s): 100693.93
[2025-06-18 18:11:04] Prefill batch. #new-seq: 13, #new-token: 5614, #cached-token: 0, token usage: 0.99, #unbootstrapped-req: 0, #queue-req: 30, #transferring-req: 1056, input throughput (token/s): 86617.45
[2025-06-18 18:11:04] Prefill out of memory. Try to lower your batch size.
Try to allocate 5614 tokens.
Available tokens: 5952
self.token_to_kv_pool_allocator.available_size()=5952
self.tree_cache.evictable_size()=0
[2025-06-18 18:11:04] Scheduler hit an exception: Traceback (most recent call last):
File "/root/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2616, in run_scheduler_process
scheduler.event_loop_overlap_disagg_prefill()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/workspace/sglang/python/sglang/srt/disaggregation/prefill.py", line 313, in event_loop_overlap_disagg_prefill
batch = self.get_new_batch_prefill()
File "/root/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1587, in get_new_batch_prefill
new_batch.prepare_for_extend()
File "/root/workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1260, in prepare_for_extend
out_cache_loc = self.alloc_paged_token_slots_extend(
File "/root/workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1016, in alloc_paged_token_slots_extend
raise RuntimeError(error_msg)
RuntimeError: Prefill out of memory. Try to lower your batch size.
Try to allocate 5614 tokens.
Available tokens: 5952
self.token_to_kv_pool_allocator.available_size()=5952
self.tree_cache.evictable_size()=0
Reproduction
With 1P1D, llama-3.2-3B model, and send a batch of requests that is beyond the server's capacity. Also, manually slow down the KV cache transfer to simulate a large workload, causing many KV caches to wait for transfer.
Environment
Metadata
Metadata
Labels
No labels