[Bug] Deepseek-v2-lite AMD MI300 run failed

### Checklist

- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.

### Describe the bug

#### Deepseek-v2 ROCM Env triton compiler error
Bug report:
```bash
WARNING 12-07 02:43:18 rocm.py:17] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
[2024-12-07 02:43:23] server_args=ServerArgs(model_path='/data/deepseek-v2-lite/', tokenizer_path='/data/deepseek-v2-lite/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/data/deepseek-v2-lite/', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.81, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, tp_size=8, stream_interval=1, random_seed=179983669, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2024-12-07 02:43:32 TP4] Process 3010 gpu_id 4 is running on CPUs: [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155]
[2024-12-07 02:43:33 TP4] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP4] Init torch distributed begin.
[2024-12-07 02:43:33 TP0] Process 3006 gpu_id 0 is running on CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107]
[2024-12-07 02:43:33 TP1] Process 3007 gpu_id 1 is running on CPUs: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119]
[2024-12-07 02:43:33 TP5] Process 3011 gpu_id 5 is running on CPUs: [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167]
[2024-12-07 02:43:33 TP7] Process 3139 gpu_id 7 is running on CPUs: [84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191]
[2024-12-07 02:43:33 TP6] Process 3075 gpu_id 6 is running on CPUs: [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179]
[2024-12-07 02:43:33 TP3] Process 3009 gpu_id 3 is running on CPUs: [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143]
[2024-12-07 02:43:33 TP1] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP1] Init torch distributed begin.
[2024-12-07 02:43:33 TP5] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP5] Init torch distributed begin.
[2024-12-07 02:43:33 TP0] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP0] Init torch distributed begin.
[2024-12-07 02:43:33 TP2] Process 3008 gpu_id 2 is running on CPUs: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131]
[2024-12-07 02:43:33 TP7] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP7] Init torch distributed begin.
[2024-12-07 02:43:33 TP6] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP6] Init torch distributed begin.
[2024-12-07 02:43:33 TP3] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP3] Init torch distributed begin.
[2024-12-07 02:43:33 TP2] MLA optimization is turned on. Use triton backend.
[2024-12-07 02:43:33 TP2] Init torch distributed begin.
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
INFO 12-07 02:43:34 pynccl_wrapper.py:188] Found nccl from library librccl.so.1
[2024-12-07 02:43:36 TP4] Load weight begin. avail mem=185.83 GB
[2024-12-07 02:43:36 TP7] Load weight begin. avail mem=185.83 GB
[2024-12-07 02:43:36 TP0] Load weight begin. avail mem=184.31 GB
[2024-12-07 02:43:36 TP5] Load weight begin. avail mem=185.80 GB
[2024-12-07 02:43:36 TP6] Load weight begin. avail mem=185.70 GB
[2024-12-07 02:43:36 TP3] Load weight begin. avail mem=185.54 GB
[2024-12-07 02:43:36 TP2] Load weight begin. avail mem=186.38 GB
[2024-12-07 02:43:36 TP1] Load weight begin. avail mem=185.55 GB
[2024-12-07 02:43:36 TP7] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP4] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP7] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP4] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP3] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP6] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP3] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP6] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP5] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP5] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP2] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP2] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-12-07 02:43:36 TP7] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP4] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP5] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP6] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP3] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP0] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP2] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP1] Skipping import of cpp extensions
[2024-12-07 02:43:36 TP7] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP4] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP5] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP6] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP3] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP0] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP2] lm_eval is not installed, GPTQ may not be usable
[2024-12-07 02:43:36 TP1] lm_eval is not installed, GPTQ may not be usable
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [03:52<11:38, 232.83s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [07:58<08:01, 240.54s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [09:31<02:53, 173.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [10:55<00:00, 137.94s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [10:55<00:00, 163.93s/it]

[2024-12-07 02:54:33 TP2] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.61 GB
[2024-12-07 02:54:33 TP7] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.06 GB
[2024-12-07 02:54:33 TP0] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=180.54 GB
[2024-12-07 02:54:33 TP4] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.06 GB
[2024-12-07 02:54:33 TP5] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=182.04 GB
[2024-12-07 02:54:33 TP1] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.79 GB
[2024-12-07 02:54:33 TP6] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.93 GB
[2024-12-07 02:54:33 TP3] Load weight end. type=DeepseekV2ForCausalLM, dtype=torch.bfloat16, avail mem=181.77 GB
[2024-12-07 02:54:33 TP5] Memory pool end. avail mem=34.51 GB
[2024-12-07 02:54:33 TP7] Memory pool end. avail mem=34.54 GB
[2024-12-07 02:54:33 TP3] Memory pool end. avail mem=34.25 GB
[2024-12-07 02:54:33 TP4] Memory pool end. avail mem=34.54 GB
[2024-12-07 02:54:33 TP6] Memory pool end. avail mem=34.41 GB
[2024-12-07 02:54:33 TP1] Memory pool end. avail mem=34.26 GB
[2024-12-07 02:54:33 TP0] Memory pool end. avail mem=33.02 GB
[2024-12-07 02:54:33 TP2] Memory pool end. avail mem=35.09 GB
[2024-12-07 02:54:35 TP1] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP2] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP6] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP7] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP4] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP3] Capture cuda graph begin. This can take up to several minutes.
[2024-12-07 02:54:35 TP5] Capture cuda graph begin. This can take up to several minutes.
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
[2024-12-07 02:54:42 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP7] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
INFO 12-07 02:54:42 custom_all_reduce.py:260] Registering 0 cuda graph addresses
[2024-12-07 02:54:42 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP5] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.

[2024-12-07 02:54:42 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1493, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 191, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 62, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 631, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 207, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 268, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 332, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 325, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 823, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 784, in forward
    hidden_states, residual = layer(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 722, in forward
    hidden_states = self.self_attn(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 493, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 579, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 59, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 181, in forward_decode
    self.decode_attention_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 701, in decode_attention_fwd
    decode_attention_fwd_grouped(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 656, in decode_attention_fwd_grouped
    _decode_grouped_softmax_reducev_fwd(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py", line 567, in _decode_grouped_softmax_reducev_fwd
    _fwd_grouped_kernel_stage2[grid](
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/runtime/jit.py", line 687, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 392, in __getattribute__
    self._init_handles()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 385, in _init_handles
    raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 131072, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.
```

### Reproduction

```bash
python -m sglang.launch_server \
         --model-path /data/deepseek-v2-lite/ \
         --dp 1 \
         --tp 8 \
         --trust-remote-code \
```

### Environment

docker image ```henryx/haisgl:sgl0.3.2_vllm0.6.0_torch2.5_rocm6.2_triton3.0.0```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

Checklist

Describe the bug

Deepseek-v2 ROCM Env triton compiler error

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Deepseek-v2-lite AMD MI300 run failed #2384

Description

Checklist

Describe the bug

Deepseek-v2 ROCM Env triton compiler error

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions