Skip to content

Conversation

ByronHsu
Copy link
Collaborator

@ByronHsu ByronHsu commented May 21, 2025

Motivation

Co-authored-by: SangBin Cho rkooo567@gmail.com

Support PD + Spec Decode. However, with llama, the decode engine crashes under high concurrency. Need further investigation.

# prefill
$ python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --disaggregation-mode prefill --disaggregation-ib-device mlx5_roce0
# decode
$ python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algorithm EAGLE \
    --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --cuda-graph-max-bs 8 --disaggregation-mode decode --disaggregation-ib-device mlx5_roce1 --base-gpu-id 1 --port 30001
# lb
$  python3 -m sglang.srt.disaggregation.mini_lb --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
# client
$ python few_shot_gsm8k.py --port 8000

Error

  File "/root/submodules/sglang/python/sglang/srt/speculative/eagle_worker.py", line 419, in draft
    score_list, token_list, parents_list = self.draft_forward(forward_batch)
  File "/root/submodules/sglang/python/sglang/srt/speculative/eagle_worker.py", line 478, in draft_forward
    logits_output = self.draft_model_runner.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/submodules/sglang/python/sglang/srt/models/llama.py", line 457, in forward
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/submodules/sglang/python/sglang/srt/models/llama_eagle.py", line 97, in forward
    hidden_states = self.fc(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

Modifications

Checklist

@zhyncs zhyncs merged commit d2e0881 into main May 23, 2025
18 of 61 checks passed
@zhyncs zhyncs deleted the byron/pd-spec branch May 23, 2025 19:03
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants