-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[Feature] Support EAGLE 3 #4247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Benchmarks on MT-Bench (bsz 1): Autoregressive:
EAGLE-3:
|
@chromecast56 Can we use |
Acceptance length is calculated as |
@chromecast56 Ours is in the round level. We calculate the number of accepted tokens in each speculative decoding round, add it by 1 (because the very last token in each round will always be accepted as it is from the target model), and average the numbers across all rounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@chromecast56 could you fix the conflicts of the docs? @zhyncs @merrymercy could we merge it? |
@merrymercy @Ying1123 reminder |
Hi @chromecast56 May you help fix the conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the great work.
cc @simveit hey simon. Eagle 3 is merged into sglang now, yienng @zhyncs will profiling it today. Could you help to update the docs https://docs.sglang.ai/backend/speculative_decoding.html after yineng provides the performance? thanks so much! |
@zhaochenyang20 Yes. Let me read the paper in the next days. |
should we change |
@ispobock can you share the exact commands to launch the servers for eagle2 and eagle3? |
@merrymercy For bs=1, the launch command is here: # EAGLE2
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE \
--speculative-draft jamesliu1/sglang-EAGLE-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
--cuda-graph-max-bs 1 --dtype float16 --port 30000 --tp 1 --disable-radix --mem-frac 0.7
# EAGLE3
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE3 \
--speculative-draft jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 8 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
--cuda-graph-max-bs 1 --dtype float16 --port 30000 --tp 1 --disable-radix --mem-frac 0.7 MT-bench is the default bench dataset in EAGLE's evaluation code. It's used to keep align with the setting in the paper. |
I'll also work on the docs about this feature these days. |
Motivation
Add support for EAGLE-3: https://arxiv.org/abs/2503.01840
Modifications
EAGLE3
speculative method to server argsllama.py, logits_processor.py
to support capturing auxiliary hidden stateseagle_worker.py
to support EAGLE-3 token map + untied LM headllama_eagle3.py
modelChecklist