Skip to content

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Feb 14, 2025

Motivation

We implemented NextN (MTP) speculative decoding for DeepSeek-V3/R1 based on EAGLE 2 on Triton backend (#3466) and achieved 1.76x speed up with CUDA Graph and Torch.compile compatibility. In current benchmark, we achieved 77 token/s output throughput on batch size 1.

In our implementation, we only use the 1 MTP module (NextN layer) from the official model checkpoint. We found it also can be used for autoregressive prediction like EAGLE. The accept rate of the MTP module is very high (~1.9 avg accept length for draft 2 tokens, e.g. --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2). We try to use it to draft more tokens and achieved better speedup. (2.5~3 avg accept length for draft 4 tokens for 2 steps, e.g. --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4)

Best practices should be further investigated through additional experiments, as predicting more tokens can increase overhead and impact throughput, especially for large batch sizes. A careful trade-off between latency and throughput is necessary to determine the optimal number of speculative tokens.

Benchmark Results

# benchmark
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256

# baseline on main branch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --trust-remote --tp 8

batch size: 1
latency: 6.70 s
output throughput: 38.19 token/s
(input + output) throughput: 76.39 token/s

# w/ nextn speculative decoding
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /sgl-workspace/DeepSeek-V3-nextn --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --trust-remote --tp 8

batch size: 1
latency: 3.77 s
output throughput: 67.93 token/s
(input + output) throughput: 135.87 token/s

# w/ nextn speculative decoding + Torch.compile
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /sgl-workspace/DeepSeek-V3-nextn --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8 --enable-torch-compile --torch-compile-max-bs 1

batch size: 1
latency: 3.29 s
output throughput: 77.73 token/s
(input + output) throughput: 155.45 token/s

Usage

Option1: Export nextn weights manually

  1. Export the weights of nextn layer with script scripts/export_deepseek_nextn.py
python3 export_deepseek_nextn.py --input-dir /path/to/DeepSeek-V3 --output-dir /path/to/DeepSeek-V3-NextN
  1. Use the nextn layer as draft model and launch the server
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft /path/to/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

Option2: Use the exported nextn weights directly

Ref: #3582 (comment)

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

@zhyncs zhyncs merged commit 862dd76 into sgl-project:main Feb 14, 2025
2 of 17 checks passed
@freeliuzc
Copy link

Great work!
Regarding the tokens for Draft 1, what is the average accepted length?
Thanks

@Swipe4057
Copy link
Contributor

Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load? How can I properly find the optimal load point?

@lambert0312
Copy link
Contributor

I wonder if MTP supports bf16?

@zhyncs
Copy link
Member

zhyncs commented Feb 15, 2025

FYI you can use these checkpoints for V3 NextN and R1 NextN instead of exporting them yourself. Cheers!

https://huggingface.co/SGLang/DeepSeek-V3-NextN
https://huggingface.co/SGLang/DeepSeek-R1-NextN

@lambert0312
Copy link
Contributor

  1. I use bf16 model and export the weights of nextn layer
    python3 export_deepseek_nextn.py --input-dir /path/to/DeepSeek-V3-bf16 --output-dir /path/to/DeepSeek-V3-NextN-bf16`

  2. Use the nextn layer as draft model and launch the server
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-bf16 --speculative-algo NEXTN --speculative-draft /path/to/DeepSeek-V3-NextN-bf16 --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --tp 8

  3. The log is as follows:
    [2025-02-15 06:17:26 TP3] Scheduler hit an exception: Traceback (most recent call last):
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 80, in init
    self.capture()
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 101, in capture
    CudaGraphRunner.capture(self)
    File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture
    ) = self.capture_one_batch_size(bs, forward)
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 164, in capture_one_batch_size
    run_once()
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 154, in run_once
    ret = self.eagle_worker.draft_forward(forward_batch)
    File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 260, in draft_forward
    logits_output = self.model_runner.model.forward(
    File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 140, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_nextn.py", line 96, in forward
    hidden_states, residual = self.decoder(
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 770, in forward
    hidden_states = self.self_attn(
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 528, in forward
    return self.forward_absorb(positions, hidden_states, forward_batch)
    File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 620, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
    File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 67, in forward
    return forward_batch.attn_backend.forward(
    File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/init.py", line 67, in forward
    return self.forward_decode(q, k, v, layer, forward_batch, save_kv_cache)
    File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 441, in forward_decode
    forward_batch.token_to_kv_pool.set_kv_buffer(
    File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 288, in set_kv_buffer
    self.k_buffer[layer_id][loc] = cache_k
    RuntimeError: shape mismatch: value tensor of shape [4, 1, 576] cannot be broadcast to indexing result of shape [4, 4, 56]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 252, in init
self.draft_worker = EAGLEWorker(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 99, in init
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 110, in init_cuda_graphs
self.cuda_graph_runner = EAGLEDraftCudaGraphRunner(self)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 82, in init
raise Exception(
Exception: Capture cuda graph failed: shape mismatch: value tensor of shape [4, 1, 576] cannot be broadcast to indexing result of shape [4, 4, 56]
Possible solutions:

disable cuda graph by --disable-cuda-graph
set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
disable torch compile by not using --enable-torch-compile
specify --dtype to the same dtype (e.g. bfloat16)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

@ispobock
Copy link
Collaborator Author

@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide.

@ispobock
Copy link
Collaborator Author

Regarding the tokens for Draft 1, what is the average accepted length?

Current --speculative-num-steps is at least 2. We will support one draft step in the following update. I think the accepted length can match the result in the paper.

@ispobock
Copy link
Collaborator Author

Could you please clarify if I understand correctly that speculative decoding does not increase throughput, and even decreases it under high load?

Speculative decoding methods can speedup for small batch sizes but is not designed for high load. But I think the nextn method can get speedup at higher batch sizes since it has a higher accept rate so that we can use less draft steps and draft tokens to get good performance.

How can I properly find the optimal load point?

Maybe you can do the benchmark with different request rate and check the throughput.

@lambert0312
Copy link
Contributor

lambert0312 commented Feb 16, 2025

@lambert0312 Which bf16 model did you use and what GPU did you use? It seems the checkpoint is not correct. Maybe you can try to convert it with this guide.

I use 4xA800 gpu and covert bf16 mtp nextn model. @ispobock

@YosanHo
Copy link

YosanHo commented Feb 20, 2025

i use latest code occur error with 8*H20

python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8

[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init
self.draft_worker = EAGLEWorker(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init
super().init(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init
self.model_runner = ModelRunner(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init
min_per_gpu_memory = self.init_torch_distributed()
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed
raise ValueError(
ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

@lambert0312
Copy link
Contributor

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

@cermeng
Copy link
Contributor

cermeng commented Feb 20, 2025

Run the benchmark provided by @ispobock for 2 nodes 8*H800, but mtp spec decode is much slower than normal. I'm not sure if it is expected

# mtp
batch size: 1
latency: 13.54 s
output throughput: 18.91 token/s
(input + output) throughput: 37.82 token/s

# w/o mtp(normal)
batch size: 1
latency: 8.53 s
output throughput: 30.02 token/s
(input + output) throughput: 60.04 token/s

@caseylai
Copy link

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.

batch size 1: from 17 t/s to 52 t/s
batch size 30: from 160 t/s to 500 t/s

But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.

my start command is
python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

@lishicheng1996
Copy link

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.

batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s

But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.

my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

@caseylai
Copy link

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.
batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s
But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.
my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command .

@lishicheng1996
Copy link

I benchmarked NextN on 2 nodes of 8 * H20 for R1, got up to 200% or more larger throughput.
batch size 1: from 17 t/s to 52 t/s batch size 30: from 160 t/s to 500 t/s
But, it is strange that the speed can not keep high steady, which is dropping slowly. In the beginning it was 500 t/s, for 2-3 hours, it dropped to 150 t/s or less by linear.
my start command is python3 -m sglang.launch_server \ --model-path /mnt/disk01/model/deepseek/DeepSeek-R1 \ --host 0.0.0.0 \ --port 8000 \ --tp 16 \ --nccl-init $master_node:7749 --nnodes 2 --node-rank $node_rank \ --trust-remote-code \ --enable-torch-compile \ --torch-compile-max-bs 8 \ --speculative-algo NEXTN \ --speculative-draft /mnt/disk01/model/deepseek/DeepSeek-R1-NextN \ --speculative-num-steps 2 \ --speculative-eagle-topk 4 \ --speculative-num-draft-tokens 4 \ --disable-radix

Hi, may I ask the version of SGLang you use and the accept length in your test? I use 0.4.3.post2, while MTP has double speed with bs=1, the speed is alomost same with bs=8.

0.4.3.post2, same with u. I don't know what accept length is, you can see all arguments in my command .

Thanks very much for your reply! We can see accept length in the log of sglang. It's the accept token num in the draft tokens, and decides the speed gain of MTP. In my test the accept length is about 2.3
Screenshot 2025-02-21 at 16 09 30

@Zhou-sx
Copy link
Contributor

Zhou-sx commented Feb 21, 2025

lambert0312
do you succeed? I'm trying to deploy on 8*H20, too.

@lambert0312
Copy link
Contributor

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

@Zhou-sx
Copy link
Contributor

Zhou-sx commented Feb 21, 2025

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

thanks.

@Zhou-sx
Copy link
Contributor

Zhou-sx commented Feb 21, 2025

i use latest code occur error with 8*H20

python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8

[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init self.draft_worker = EAGLEWorker( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init super().init( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed raise ValueError( ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

do you succeed?

@victorserbu2709
Copy link

When i try to run on 2 nodes 8xh100 using docker image lmsysorg/sglang:v0.4.3.post2-cu125-srt

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics --watchdog-timeout=3000 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix 

it stucks at

0%| | 0/34 [00:00<?, ?it/s][2025-02-21 12:51:48 TP6] Capture cuda graph begin. This can take up to several minutes.

if I add --disable-cuda-graph it starts but output throughput is only 15token/s

[2025-02-21 13:11:34 TP0] Decode batch. #running-req: 1, #token: 1435, token usage: 0.00, accept len: 2.15, gen throughput (token/s): 14.30, #queue-req: 0
[2025-02-21 13:11:40 TP0] Decode batch. #running-req: 1, #token: 1525, token usage: 0.01, accept len: 2.25, gen throughput (token/s): 15.06, #queue-req: 0

If i run with

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

[2025-02-21 13:34:54 TP0] Decode batch. #running-req: 1, #token: 184, token usage: 0.00, gen throughput (token/s): 29.53, #queue-req: 0

@yuqie
Copy link

yuqie commented Feb 22, 2025

Hi, does NextN compatible with bench_one_batch?I try deepseek R1 on 8*H200 with python3 -m sglang.bench_one_batch --trust-remote-code --run-name DeepSeekR1 --model-path /mnt/model/ --batch-size 2 --speculative-algo NEXTN --speculative-draft /mnt/huggingface/DeepSeek-R1-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --input-len 1000 --output-len 1 --tensor-parallel-size 8 --disable-radix and encounter the “tensor size does not match” error as following

max_total_num_tokens=480079
Warmup ...
[2025-02-22 01:58:01 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-02-22 01:58:01 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
Prefill. latency: 8.30952 s, throughput:    240.69 token/s
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 432, in latency_test
    latency_test_run_once(
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 370, in latency_test_run_once
    next_token_ids, _ = decode(next_token_ids, batch, model_runner)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 254, in decode
    logits_output = model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 791, in forward
    return self.cuda_graph_runner.replay(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 423, in replay
    self.input_ids[:raw_num_token].copy_(forward_batch.input_ids)
RuntimeError: The size of tensor a (8) must match the size of tensor b (2) at non-singleton dimension 0

@jifa513
Copy link

jifa513 commented Feb 22, 2025

i use latest code occur error with 8*H20

python -m sglang.launch_server --model-path /opt/model/DeepSeek-R1 --trust-remote-code --served-model-name deepseek-r1 --enable-metrics --speculative-algo NEXTN --speculative-draft /opt/model/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.9 --tp 8

[2025-02-20 10:32:44 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 252, in init self.draft_worker = EAGLEWorker( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/speculative/eagle_worker.py", line 47, in init super().init( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in init self.model_runner = ModelRunner( File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 187, in init min_per_gpu_memory = self.init_torch_distributed() File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 280, in init_torch_distributed raise ValueError( ValueError: The memory capacity is unbalanced. Some GPUs may be occupied by other processes.

do you succeed?

the same problem with 8*H200

@YosanHo
Copy link

YosanHo commented Feb 22, 2025

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

I run succeed with static at 0.87,and modify soucecode in model_runner.py(line 280) to skip validate,but the performance is very poor

@ehuaa
Copy link
Contributor

ehuaa commented Feb 23, 2025

Hi @lambert0312 , have you fixed this problem on 4*A100 nodes? I met this problem too.

Not yet, trying @ehuaa

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill?

@Zhou-sx
Copy link
Contributor

Zhou-sx commented Feb 24, 2025

mem-fraction-static

@YosanHo Maybe you need to adjust the mem-fraction-static parameter

I run succeed with static at 0.87,and modify soucecode in model_runner.py(line 280) to skip validate,but the performance is very poor
Why modifying mem-fraction-static can solve the problem of unbalanced memory capacity?

@lambert0312
Copy link
Contributor

Hi @lambert0312 ,how did you fix the problem on 4*A800 nodes, i still stucked here. Is it caused by chunked_prefill?

@ehuaa What version are you using?

@kimlee1874
Copy link

kimlee1874 commented Feb 26, 2025

I did a benchmark test with bench_serving.py on 2 x 8 x H800, and here is my startup script with MTP (0.4.3.post2):
python -m sglang.launch_server --model-path ./DeepSeek-R1/ --tp 16 --dist-init-addr $IP_PORT --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --speculative-algo NEXTN --speculative-draft ./DeepSeek-R1-NextN/ --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.75

A very strange phenomenon is:

  1. When the isl/osl=1k/1k, the speed increase brought by MTP is 1.6X (bs 1) and 1.4X (bs 8)
  2. But when isl is increased to 8K, MTP has almost no speed increase from bs 1, and starts to show negative growth at bs 16

Why does MTP become less effective when isl becomes longer?

@RonanKMcGovern
Copy link

RonanKMcGovern commented Feb 26, 2025 via email

@RonanKMcGovern
Copy link

RonanKMcGovern commented Feb 26, 2025

--nnodes 2

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs?

@jokerwyt
Copy link
Contributor

jokerwyt commented Feb 27, 2025

       parser.add_argument(
            "--speculative-num-steps",
            type=int,
            help="The number of steps sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_steps,
        )
        parser.add_argument(
            "--speculative-num-draft-tokens",
            type=int,
            help="The number of token sampled from draft model in Speculative Decoding.",
            default=ServerArgs.speculative_num_draft_tokens,
        )

These two parameters make me feel very confused, what are the specific meanings of these two parameters? Why can't it be as simple as vllm, which has only one parameter --num_speculative_tokens, which is to predict how much tokens

@pipul My guess is speculative-num-steps indicate how many times you forward the draft model (each time you select top k of the tree path from root to leave, and get k new node), and num_speculative_tokens represent the node number of the draft tree, according to EAGLE-2 paper.

@jokerwyt
Copy link
Contributor

--nnodes 2

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --dist-init-addr 172.16.1.68:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0  --enable-cache-report --enable-metrics  --enable-flashinfer-mla  --watchdog-timeout=3000

it obtains ~30 output tokens/s

That's pretty interesting, where did you get that idea from to use flashinfer-mla? Shouldn't that be automatic as shown by "MLA optimization is turned on. Use triton backend." in the logs?

I saw that log and I think it is telling me triton backend is used, instead of flashinfer 😂.

@seanxcwang
Copy link

do you succeed? I'm trying to deploy on 8*H20, too.

@Zhou-sx Sorry, I just saw the message. I have already started running on 4 A800 nodes. However, our scenario is a long context. Currently, chunked_prefill is turned off in NEXTN mode, so OOM often occurs.

why disable chunked_prefill in NEXTN mode, I also got this error

@jokerwyt
Copy link
Contributor

jokerwyt commented Mar 13, 2025

image I expect the time we spend on "verify" part should be close to a normal decode forward (less than 100ms, my setting is bs=16 and ctx=12k), but now it takes about 400ms. It slows down my output throughput severely. Seems like a kernel performance issue?

The commit I test:
commit 4a05bdf (gh/main)
Author: Lianmin Zheng lianminzheng@gmail.com
Date: Sun Mar 9 18:53:33 2025 -0700

Revert "Check eagle server args" (#4242)

@ZJLi2013
Copy link

btw, is there easy way to visual the draft tree once built, for debug ?

@parambole
Copy link

Hey @ispobock & @zhyncs I am currently working on integrating Deepseek's Multi-Token Prediction into Maxtext.

Question:

As part of this PR, has the team been able to load the open MTP Deepseek V3 weights and analyze the implementation during pre-training and fine-tuning? I am curious to know about any observed behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.