Skip to content

[Bug] start failed if enable memory saver #4985

@kebe7jun

Description

@kebe7jun

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

(sglang-dev) ubuntu@g-worker-102:~/kebe/sglang$ PYTHONPATH=python python python/sglang/launch_server.py --model ../DeepSeek-R1-Distill-Qwen-32B/ --trust-remote-code  --served-model-name ds --tensor-parallel-size 4 --dtype bfloat16 --max-running-requests 100  --enable-memory-saver
INFO 04-02 13:31:02 __init__.py:190] Automatically detected platform cuda.
enable_memory_saver is enabled, but torch-memory-saver is not installed. Please install it via `pip3 uninstall torch-memory-saver`.
[2025-04-02 13:31:04] server_args=ServerArgs(model_path='../DeepSeek-R1-Distill-Qwen-32B/', tokenizer_path='../DeepSeek-R1-Distill-Qwen-32B/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='bfloat16', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='ds', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.85, max_running_requests=100, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=4, stream_interval=1, stream_output=False, random_seed=792507887, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode=None, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=80, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=True, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, enable_flashinfer_mla=False, enable_flashmla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998)
Traceback (most recent call last):
  File "/home/ubuntu/kebe/sglang/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/home/ubuntu/kebe/sglang/python/sglang/srt/entrypoints/http_server.py", line 679, in launch_server
    tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
  File "/home/ubuntu/kebe/sglang/python/sglang/srt/entrypoints/engine.py", line 494, in _launch_subprocesses
    with memory_saver_adapter.configure_subprocess():
  File "/home/ubuntu/kebe/sglang/python/sglang/srt/torch_memory_saver_adapter.py", line 48, in configure_subprocess
    return torch_memory_saver.configure_subprocess()
NameError: name 'torch_memory_saver' is not defined

Reproduction

See upon

Environment

Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 4090
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda-12.4
NVCC: Cuda compilation tools, release 12.4, V12.4.99
CUDA Driver Version: 550.54.14
PyTorch: 2.5.1+cu124
sglang: 0.4.4.post3
sgl_kernel: 0.0.6
flashinfer: Module Not Found
triton: 3.1.0
transformers: 4.50.0
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.14
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.29.3
interegular: 0.3.3
modelscope: 1.24.0
orjson: 3.10.16
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
xgrammar: 0.1.17
openai: 1.68.2
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.63.14
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX PXB PXB SYS SYS SYS SYS PXB PXB 0-19,40-59 0 N/A
GPU1 PIX X PXB PXB SYS SYS SYS SYS PXB PXB 0-19,40-59 0 N/A
GPU2 PXB PXB X PXB SYS SYS SYS SYS PIX PIX 0-19,40-59 0 N/A
GPU3 PXB PXB PXB X SYS SYS SYS SYS PXB PXB 0-19,40-59 0 N/A
GPU4 SYS SYS SYS SYS X PIX PXB PXB SYS SYS 20-39,60-79 1 N/A
GPU5 SYS SYS SYS SYS PIX X PXB PXB SYS SYS 20-39,60-79 1 N/A
GPU6 SYS SYS SYS SYS PXB PXB X PXB SYS SYS 20-39,60-79 1 N/A
GPU7 SYS SYS SYS SYS PXB PXB PXB X SYS SYS 20-39,60-79 1 N/A
NIC0 PXB PXB PIX PXB SYS SYS SYS SYS X PIX
NIC1 PXB PXB PIX PXB SYS SYS SYS SYS PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1

ulimit soft: 1024

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions