[PD] support spec decode #6507

ByronHsu · 2025-05-21T22:51:16Z

Motivation

Co-authored-by: SangBin Cho rkooo567@gmail.com

Support PD + Spec Decode. However, with llama, the decode engine crashes under high concurrency. Need further investigation.

# prefill
$ python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --disaggregation-mode prefill --disaggregation-ib-device mlx5_roce0
# decode
$ python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --disaggregation-mode decode --disaggregation-ib-device mlx5_roce1 --base-gpu-id 1 --port 30001
# lb
$  python3 -m sglang.srt.disaggregation.mini_lb --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
# client
$ python few_shot_gsm8k.py --port 8000

Error

  File "/root/submodules/sglang/python/sglang/srt/speculative/eagle_worker.py", line 419, in draft
    score_list, token_list, parents_list = self.draft_forward(forward_batch)
  File "/root/submodules/sglang/python/sglang/srt/speculative/eagle_worker.py", line 478, in draft_forward
    logits_output = self.draft_model_runner.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/submodules/sglang/python/sglang/srt/models/llama.py", line 457, in forward
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/submodules/sglang/python/sglang/srt/models/llama_eagle.py", line 97, in forward
    hidden_states = self.fc(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

ByronHsu requested review from merrymercy, Ying1123, zhyncs, hnyls2002 and xiezhq-hermann as code owners May 21, 2025 22:51

zhyncs added the high priority label May 21, 2025

ByronHsu added 2 commits May 21, 2025 23:05

[PD] support spec decode

5e067e6

.

abcc9d6

ByronHsu force-pushed the byron/pd-spec branch from 887404b to abcc9d6 Compare May 21, 2025 23:05

.

78c5f63

ByronHsu mentioned this pull request May 21, 2025

[Roadmap] Prefill and Decoding Disaggregation #4655

Open

13 tasks

zhyncs and others added 5 commits May 21, 2025 17:20

Merge branch 'main' into byron/pd-spec

7edf165

Merge branch 'main' into byron/pd-spec

ac68412

Merge branch 'main' into byron/pd-spec

96048ac

fix import

95166ba

re enable disagg test

fa48bbb

ByronHsu force-pushed the byron/pd-spec branch from ca61fe1 to fa48bbb Compare May 22, 2025 16:49

fix oom

462e259

ByronHsu force-pushed the byron/pd-spec branch from 7e26fad to 462e259 Compare May 23, 2025 04:53

Merge branch 'main' into byron/pd-spec

e1d7e5d

zhyncs approved these changes May 23, 2025

View reviewed changes

zhyncs merged commit d2e0881 into main May 23, 2025
18 of 61 checks passed

zhyncs deleted the byron/pd-spec branch May 23, 2025 19:03

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025

[PD] support spec decode (sgl-project#6507)

03b5c1c

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

[PD] support spec decode (sgl-project#6507)

6966624

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PD] support spec decode #6507

[PD] support spec decode #6507

Uh oh!

ByronHsu commented May 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[PD] support spec decode #6507

[PD] support spec decode #6507

Uh oh!

Conversation

ByronHsu commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!

ByronHsu commented May 21, 2025 •

edited

Loading