Skip to content

Conversation

chromecast56
Copy link
Contributor

Motivation

Add support for EAGLE-3: https://arxiv.org/abs/2503.01840

Modifications

  • Add EAGLE3 speculative method to server args
  • Refactor llama.py, logits_processor.py to support capturing auxiliary hidden states
  • Modify eagle_worker.py to support EAGLE-3 token map + untied LM head
  • Add llama_eagle3.py model
  • Tests and Documentation

Checklist

@zhyncs
Copy link
Member

zhyncs commented Mar 10, 2025

cc @Liyuhui-12 @hongyanz

@chromecast56
Copy link
Contributor Author

Benchmarks on MT-Bench (bsz 1):

Autoregressive:


python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000 --cuda-graph-max-bs 1

#questions: 1, Throughput: 147.05 token/s, Acceptance length: 1.00

EAGLE-3:

python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE3 \
    --speculative-draft jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
    --cuda-graph-max-bs 1 --mem-fraction 0.7 --dtype float16 --port 30000

#questions: 80, Throughput: 336.61 token/s, Acceptance length: 4.29

@zhyncs
Copy link
Member

zhyncs commented Mar 10, 2025

@chromecast56 Can we use --speculative-num-steps 8

@chromecast56
Copy link
Contributor Author

chromecast56 commented Mar 11, 2025

Acceptance length is calculated as num_output_tokens / num_verify_ct over the entire dataset so it should take it into account. @hongyanz Do you normalize acceptance length per prompt, or is it over the entire dataset?

@hongyanz
Copy link

hongyanz commented Mar 12, 2025

@chromecast56 Ours is in the round level. We calculate the number of accepted tokens in each speculative decoding round, add it by 1 (because the very last token in each round will always be accepted as it is from the target model), and average the numbers across all rounds.

Copy link
Collaborator

@ispobock ispobock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhaochenyang20
Copy link
Collaborator

@chromecast56 could you fix the conflicts of the docs? @zhyncs @merrymercy could we merge it?

@zhyncs
Copy link
Member

zhyncs commented Mar 13, 2025

@merrymercy @Ying1123 reminder

@zhyncs
Copy link
Member

zhyncs commented Mar 17, 2025

Hi @chromecast56 May you help fix the conflicts

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work.

@zhyncs zhyncs merged commit 9e0186f into sgl-project:main Mar 18, 2025
34 of 36 checks passed
@zhaochenyang20
Copy link
Collaborator

cc @simveit hey simon. Eagle 3 is merged into sglang now, yienng @zhyncs will profiling it today. Could you help to update the docs https://docs.sglang.ai/backend/speculative_decoding.html after yineng provides the performance? thanks so much!

@simveit
Copy link
Contributor

simveit commented Mar 18, 2025

@zhaochenyang20 Yes. Let me read the paper in the next days.

@finger92
Copy link
Contributor

should we change
"You can enable EAGLE-3 decoding by setting --speculative_draft_model_path: EAGLE3:"
to
"You can enable EAGLE-3 decoding by setting --speculative-algorithm EAGLE3:"
?

@zhyncs zhyncs mentioned this pull request Mar 22, 2025
6 tasks
@merrymercy
Copy link
Contributor

@ispobock can you share the exact commands to launch the servers for eagle2 and eagle3?
We do not need to benchmark on MT-bench, we can just use python3 -m sglang.test.send_one

@ispobock
Copy link
Collaborator

@merrymercy For bs=1, the launch command is here:

# EAGLE2
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE \
    --speculative-draft jamesliu1/sglang-EAGLE-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
    --cuda-graph-max-bs 1 --dtype float16 --port 30000 --tp 1 --disable-radix --mem-frac 0.7

# EAGLE3
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE3 \
    --speculative-draft jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 8 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
    --cuda-graph-max-bs 1 --dtype float16 --port 30000 --tp 1 --disable-radix --mem-frac 0.7

MT-bench is the default bench dataset in EAGLE's evaluation code. It's used to keep align with the setting in the paper.
python3 -m sglang.test.send_one also works for benchmark.

@ryang-max
Copy link
Contributor

I'll also work on the docs about this feature these days.
cc @zhaochenyang20 @simveit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.