Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

ispobock · 2025-02-14T13:46:02Z

Motivation

We implemented NextN (MTP) speculative decoding for DeepSeek-V3/R1 based on EAGLE 2 on Triton backend (#3466) and achieved 1.76x speed up with CUDA Graph and Torch.compile compatibility. In current benchmark, we achieved 77 token/s output throughput on batch size 1.

In our implementation, we only use the 1 MTP module (NextN layer) from the official model checkpoint. We found it also can be used for autoregressive prediction like EAGLE. The accept rate of the MTP module is very high (~1.9 avg accept length for draft 2 tokens, e.g. --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2). We try to use it to draft more tokens and achieved better speedup. (2.5~3 avg accept length for draft 4 tokens for 2 steps, e.g. --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4)

Best practices should be further investigated through additional experiments, as predicting more tokens can increase overhead and impact throughput, especially for large batch sizes. A careful trade-off between latency and throughput is necessary to determine the optimal number of speculative tokens.

Benchmark Results

# benchmark
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256

# baseline on main branch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote --tp 8

batch size: 1
latency: 6.70 s
output throughput: 38.19 token/s
(input + output) throughput: 76.39 token/s

# w/ nextn speculative decoding
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /sgl-workspace/DeepSeek-V3-nextn --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --trust-remote --tp 8

batch size: 1
latency: 3.77 s
output throughput: 67.93 token/s
(input + output) throughput: 135.87 token/s

# w/ nextn speculative decoding + Torch.compile
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /sgl-workspace/DeepSeek-V3-nextn --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8 --enable-torch-compile --torch-compile-max-bs 1

batch size: 1
latency: 3.29 s
output throughput: 77.73 token/s
(input + output) throughput: 155.45 token/s

Usage

Option1: Export nextn weights manually

Export the weights of nextn layer with script scripts/export_deepseek_nextn.py

python3 export_deepseek_nextn.py --input-dir /path/to/DeepSeek-V3 --output-dir /path/to/DeepSeek-V3-NextN

Use the nextn layer as draft model and launch the server

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /path/to/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

Option2: Use the exported nextn weights directly

Ref: #3582 (comment)

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

freeliuzc · 2025-02-15T05:27:36Z

Great work！
Regarding the tokens for Draft 1, what is the average accepted length?
Thanks

Swipe4057 · 2025-02-15T08:09:12Z

Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load? How can I properly find the optimal load point?

lambert0312 · 2025-02-15T11:06:21Z

I wonder if MTP supports bf16?

zhyncs · 2025-02-15T17:43:54Z

FYI you can use these checkpoints for V3 NextN and R1 NextN instead of exporting them yourself. Cheers!

https://huggingface.co/SGLang/DeepSeek-V3-NextN
https://huggingface.co/SGLang/DeepSeek-R1-NextN

lambert0312 · 2025-02-16T00:27:32Z

I use bf16 model and export the weights of nextn layer
python3 export_deepseek_nextn.py --input-dir /path/to/DeepSeek-V3-bf16 --output-dir /path/to/DeepSeek-V3-NextN-bf16`
Use the nextn layer as draft model and launch the server
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-bf16 --speculative-algo NEXTN --speculative-draft /path/to/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8
The log is as follows:
[2025-02-15 06:17:26 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 80, in init
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 101, in capture
CudaGraphRunner.capture(self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture
) = self.capture_one_batch_size(bs, forward)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 164, in capture_one_batch_size
run_once()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 154, in run_once
ret = self.eagle_worker.draft_forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 260, in draft_forward
logits_output = self.model_runner.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 140, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 96, in forward
hidden_states, residual = self.decoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 770, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 528, in forward
return self.forward_absorb(positions, hidden_states, forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 620, in forward_absorb
attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 67, in forward
return forward_batch.attn_backend.forward(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/init.py", line 67, in forward
return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 441, in forward_decode
forward_batch.token_to_kv_pool.set_kv_buffer(
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 288, in set_kv_buffer
self.k_buffer[layer_id][loc] = cache_k
RuntimeError: shape mismatch: value tensor of shape [4, 1, 576] cannot be broadcast to indexing result of shape [4, 4, 56]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 252, in init
self.draft_worker = EAGLEWorker(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 99, in init
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 110, in init_cuda_graphs
self.cuda_graph_runner = EAGLEDraftCudaGraphRunner(self)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 82, in init
raise Exception(
Exception: Capture cuda graph failed: shape mismatch: value tensor of shape [4, 1, 576] cannot be broadcast to indexing result of shape [4, 4, 56]
Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
disable torch compile by not using --enable-torch-compile
specify --dtype to the same dtype (e.g. bfloat16)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

ispobock · 2025-02-16T08:05:17Z

@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide.

ispobock · 2025-02-16T08:39:39Z

Regarding the tokens for Draft 1, what is the average accepted length?

Current --speculative-num-steps is at least 2. We will support one draft step in the following update. I think the accepted length can match the result in the paper.

ispobock · 2025-02-16T08:52:28Z

Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load?

Speculative decoding methods can speedup for small batch sizes but is not designed for high load. But I think the nextn method can get speedup at higher batch sizes since it has a higher accept rate so that we can use less draft steps and draft tokens to get good performance.

How can I properly find the optimal load point?

Maybe you can do the benchmark with different request rate and check the throughput.

lambert0312 · 2025-02-16T09:21:26Z

@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide.

I use 4xA800 gpu and covert bf16 mtp nextn model. @ispobock

YosanHo · 2025-02-20T11:31:46Z

i use latest code occur error with 8*H20

python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8

[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init
self.draft_worker = EAGLEWorker(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init
super().init(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init
min_per_gpu_memory = self.init_torch_distributed()
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed
raise ValueError(
ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

lambert0312 · 2025-02-20T14:13:42Z

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

cermeng · 2025-02-20T15:58:12Z

Run the benchmark provided by @ispobock for 2 nodes 8*H800, but mtp spec decode is much slower than normal. I'm not sure if it is expected

# mtp
batch size: 1
latency: 13.54 s
output throughput: 18.91 token/s
(input + output) throughput: 37.82 token/s

# w/o mtp(normal)
batch size: 1
latency: 8.53 s
output throughput: 30.02 token/s
(input + output) throughput: 60.04 token/s

caseylai · 2025-02-21T00:58:00Z

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.

batch size 1: from 17 t/s to 52 t/s
batch size 30: from 160 t/s to 500 t/s

But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.

my start command is
python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

lishicheng1996 · 2025-02-21T07:25:20Z

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.

batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s

But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.

my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

caseylai · 2025-02-21T07:35:09Z

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.
batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s
But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.
my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command .

lishicheng1996 · 2025-02-21T08:13:35Z

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.
batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s
But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.
my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command .

Thanks very much for your reply! We can see accept length in the log of sglang. It's the accept token num in the draft tokens, and decides the speed gain of MTP. In my test the accept length is about 2.3

Zhou-sx · 2025-02-21T08:55:39Z

lambert0312
do you succeed? I'm trying to deploy on 8*H20, too.

lambert0312 · 2025-02-21T11:28:23Z

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

Zhou-sx · 2025-02-21T11:45:15Z

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

thanks.

Zhou-sx · 2025-02-21T11:46:12Z

i use latest code occur error with 8*H20
python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8
[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init self.draft_worker = EAGLEWorker( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init super().init( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed raise ValueError( ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

do you succeed?

victorserbu2709 · 2025-02-21T14:11:08Z

When i try to run on 2 nodes 8xh100 using docker image lmsysorg/sglang:v0.4.3.post2-cu125-srt

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics --watchdog-timeout=3000 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix

it stucks at

0%| | 0/34 [00:00<?, ?it/s][2025-02-21 12:51:48 TP6] Capture cuda graph begin. This can take up to several minutes.

if I add --disable-cuda-graph it starts but output throughput is only 15token/s

[2025-02-21 13:11:34 TP0] Decode batch. #running-req: 1, #token: 1435, token usage: 0.00, accept len: 2.15, gen throughput (token/s): 14.30, #queue-req: 0
[2025-02-21 13:11:40 TP0] Decode batch. #running-req: 1, #token: 1525, token usage: 0.01, accept len: 2.25, gen throughput (token/s): 15.06, #queue-req: 0

If i run with

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

[2025-02-21 13:34:54 TP0] Decode batch. #running-req: 1, #token: 184, token usage: 0.00, gen throughput (token/s): 29.53, #queue-req: 0

yuqie · 2025-02-22T03:38:26Z

Hi, does NextN compatible with bench_one_batch？I try deepseek R1 on 8*H200 with python3 -m sglang.bench_one_batch --trust-remote-code --run-name DeepSeekR1 --model-path /mnt/model/ --batch-size 2 --speculative-algo NEXTN --speculative-draft /mnt/huggingface/DeepSeek-R1-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --input-len 1000 --output-len 1 --tensor-parallel-size 8 --disable-radix and encounter the “tensor size does not match” error as following

max_total_num_tokens=480079
Warmup ...
[2025-02-22 01:58:01 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
Prefill. latency: 8.30952 s, throughput:    240.69 token/s
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 432, in latency_test
    latency_test_run_once(
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 370, in latency_test_run_once
    next_token_ids, _ = decode(next_token_ids, batch, model_runner)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 254, in decode
    logits_output = model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791, in forward
    return self.cuda_graph_runner.replay(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 423, in replay
    self.input_ids[:raw_num_token].copy_(forward_batch.input_ids)
RuntimeError: The size of tensor a (8) must match the size of tensor b (2) at non-singleton dimension 0

jifa513 · 2025-02-22T06:30:35Z

i use latest code occur error with 8*H20
python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8
[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init self.draft_worker = EAGLEWorker( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init super().init( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed raise ValueError( ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.
do you succeed?

the same problem with 8*H200

YosanHo · 2025-02-22T23:49:27Z

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

I run succeed with static at 0.87，and modify soucecode in model_runner.py(line 280) to skip validate，but the performance is very poor

ehuaa · 2025-02-23T14:47:46Z

Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too.

Not yet, trying @ehuaa

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill?

Zhou-sx · 2025-02-24T10:14:57Z

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

I run succeed with static at 0.87，and modify soucecode in model_runner.py(line 280) to skip validate，but the performance is very poor
Why modifying mem-fraction-static can solve the problem of unbalanced memory capacity？

lambert0312 · 2025-02-24T23:52:27Z

Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill?

@ehuaa What version are you using?

kimlee1874 · 2025-02-26T07:41:57Z

I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is my startup script with MTP (0.4.3.post2):
python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16 --dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --speculative-algo NEXTN --speculative-draft ./DeepSeek-R1-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75

A very strange phenomenon is:

When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X (bs 1) and 1.4X (bs 8)
But when isl is increased to 8K, MTP has almost no speed increase from bs 1, and starts to show negative growth at bs 16

Why does MTP become less effective when isl becomes longer?

RonanKMcGovern · 2025-02-26T08:08:35Z

Could you share what output tokens per second and latency you got for each of those tests? Many thanks

…

On Wed, Feb 26, 2025 at 7:42 AM kimlee1874 ***@***.***> wrote: I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is my startup script with MTP: python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16 --dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --speculative-algo NEXTN --speculative-draft ./DeepSeek-V3-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75 A very strange phenomenon is: 1. When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X (bs 1) and 1.4X (bs 8) 2. But when isl is increased to 8K, MTP has almost no speed increase from bs 1, and starts to show negative growth at bs 16 Why does MTP become less effective when isl becomes longer? — Reply to this email directly, view it on GitHub <#3582 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CSDFOANYFUS3CFC5F32RVV55AVCNFSM6AAAAABXEVQ2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBUGE3TQMBWGA> . You are receiving this because you commented.Message ID: ***@***.***> [image: kimlee1874]*kimlee1874* left a comment (sgl-project/sglang#3582) <#3582 (comment)> I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is my startup script with MTP: python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16 --dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --speculative-algo NEXTN --speculative-draft ./DeepSeek-V3-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75 A very strange phenomenon is: 1. When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X (bs 1) and 1.4X (bs 8) 2. But when isl is increased to 8K, MTP has almost no speed increase from bs 1, and starts to show negative growth at bs 16 Why does MTP become less effective when isl becomes longer? — Reply to this email directly, view it on GitHub <#3582 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CSDFOANYFUS3CFC5F32RVV55AVCNFSM6AAAAABXEVQ2B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBUGE3TQMBWGA> . You are receiving this because you commented.Message ID: ***@***.***>

RonanKMcGovern · 2025-02-26T08:54:31Z

--nnodes 2

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs?

jokerwyt · 2025-02-27T07:26:13Z

       parser.add_argument(
            "--speculative-num-steps",
            type=int,
            help="The number of steps sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_steps,
        )
        parser.add_argument(
            "--speculative-num-draft-tokens",
            type=int,
            help="The number of token sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_draft_tokens,
        )
These two parameters make me feel very confused, what are the specific meanings of these two parameters? Why can't it be as simple as vllm, which has only one parameter --num_speculative_tokens, which is to predict how much tokens

@pipul My guess is speculative-num-steps indicate how many times you forward the draft model (each time you select top k of the tree path from root to leave, and get k new node), and num_speculative_tokens represent the node number of the draft tree, according to EAGLE-2 paper.

jokerwyt · 2025-02-27T07:56:35Z

--nnodes 2
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000
it obtains ~30 output tokens/s
That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs?

I saw that log and I think it is telling me triton backend is used, instead of flashinfer 😂.

seanxcwang · 2025-03-01T15:20:35Z

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

why disable chunked_prefill in NEXTN mode, I also got this error

jokerwyt · 2025-03-13T03:39:40Z

I expect the time we spend on "verify" part should be close to a normal decode forward (less than 100ms, my setting is bs=16 and ctx=12k), but now it takes about 400ms. It slows down my output throughput severely. Seems like a kernel performance issue?

The commit I test:
commit 4a05bdf (gh/main)
Author: Lianmin Zheng lianminzheng@gmail.com
Date: Sun Mar 9 18:53:33 2025 -0700

Revert "Check eagle server args" (#4242)

ZJLi2013 · 2025-03-26T09:38:47Z

btw, is there easy way to visual the draft tree once built, for debug ?

parambole · 2025-07-09T03:33:36Z

Hey @ispobock & @zhyncs I am currently working on integrating Deepseek's Multi-Token Prediction into Maxtext.

Question:

As part of this PR, has the team been able to load the open MTP Deepseek V3 weights and analyze the implementation during pre-training and fine-tuning? I am curious to know about any observed behavior?

ispobock added 9 commits February 14, 2025 14:03

load nextn weights

360098d

update

1a08743

add script

7c56a60

update

6a36826

fix

4766d0d

add usage

718e97a

update

d437bb2

update

e84f21f

fix layer

ea107f2

ispobock requested review from merrymercy, Ying1123, hnyls2002, zhyncs and ByronHsu as code owners February 14, 2025 13:46

Merge branch 'main' into nextn

6f04db1

zhyncs added the high priority label Feb 14, 2025

Merge branch 'main' into nextn

76664bb

zhyncs approved these changes Feb 14, 2025

View reviewed changes

zhyncs merged commit 862dd76 into sgl-project:main Feb 14, 2025
2 of 17 checks passed

ispobock mentioned this pull request Feb 11, 2025

[Track] DeepSeek V3/R1 nextn progress #3472

Closed

13 tasks

jokerwyt mentioned this pull request Mar 13, 2025

Deepseek-R1 MTP poor performance #4360

Closed

weedge mentioned this pull request May 15, 2025

feat: add VITA-Audio ai-bot-pro/achatbot#146

Merged

tianyuzhou95 mentioned this pull request Jul 11, 2025

concurrently load weights of DeepseekV2ForCausalLM #7943

Merged

6 tasks

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Conversation

ispobock commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark Results

Usage

Option1: Export nextn weights manually

Option2: Use the exported nextn weights directly

Uh oh!

Uh oh!

freeliuzc commented Feb 15, 2025

Uh oh!

Swipe4057 commented Feb 15, 2025

Uh oh!

lambert0312 commented Feb 15, 2025

Uh oh!

zhyncs commented Feb 15, 2025

Uh oh!

lambert0312 commented Feb 16, 2025

Uh oh!

ispobock commented Feb 16, 2025

Uh oh!

ispobock commented Feb 16, 2025

Uh oh!

ispobock commented Feb 16, 2025

Uh oh!

lambert0312 commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YosanHo commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lambert0312 commented Feb 20, 2025

Uh oh!

cermeng commented Feb 20, 2025

Uh oh!

caseylai commented Feb 21, 2025

Uh oh!

lishicheng1996 commented Feb 21, 2025

Uh oh!

caseylai commented Feb 21, 2025

Uh oh!

lishicheng1996 commented Feb 21, 2025

Uh oh!

Zhou-sx commented Feb 21, 2025

Uh oh!

lambert0312 commented Feb 21, 2025

Uh oh!

Zhou-sx commented Feb 21, 2025

Uh oh!

Zhou-sx commented Feb 21, 2025

Uh oh!

victorserbu2709 commented Feb 21, 2025

Uh oh!

yuqie commented Feb 22, 2025

Uh oh!

jifa513 commented Feb 22, 2025

Uh oh!

YosanHo commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehuaa commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zhou-sx commented Feb 24, 2025

Uh oh!

lambert0312 commented Feb 24, 2025

Uh oh!

kimlee1874 commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RonanKMcGovern commented Feb 26, 2025 via email

Uh oh!

RonanKMcGovern commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jokerwyt commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ispobock commented Feb 14, 2025 •

edited

Loading

lambert0312 commented Feb 16, 2025 •

edited

Loading

YosanHo commented Feb 20, 2025 •

edited

Loading

YosanHo commented Feb 22, 2025 •

edited

Loading

ehuaa commented Feb 23, 2025 •

edited

Loading

kimlee1874 commented Feb 26, 2025 •

edited

Loading

RonanKMcGovern commented Feb 26, 2025 •

edited

Loading

jokerwyt commented Feb 27, 2025 •

edited

Loading

jokerwyt commented Mar 13, 2025 •

edited

Loading