-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Support EAGLE draft extend CUDA graph #6606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
From the results, it can be seen that performance is better when batch size is 1 or 4, and worse when batch size is 16 or 32 compared to the main. |
from sglang.srt.speculative.eagle_worker import EAGLEWorker | ||
|
||
|
||
class EAGLEDraftExtendCudaGraphRunner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW when enabling extended CUDA Graph, we should also adjust mem_fraction_static
to avoid running out of memory. For example, set it to 5/4/8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python3 -m sglang.launch_server --model /dev/shm/DeepSeek-V3-0324 --tp 8 --trust-remote-code --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --speculative-algorithm EAGLE
lm_eval --model local-chat-completions --model_args model=/dev/shm/DeepSeek-V3-0324,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks gsm8k --batch_size 128 --apply_chat_template --num_fewshot 8
This will be OOM.
Performance improved after 9055a49.
|
There is an accept rate issue on large batch size. It should be fixed before merge. |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces CUDA graph support for EAGLE draft extend, which is a significant step towards improving performance for speculative decoding. The changes look promising, with performance gains demonstrated in the PR description.
I've identified a couple of areas for improvement, primarily related to maintainability and a potential correctness issue in attention backend initialization. Overall, the new EAGLEDraftExtendCudaGraphRunner
class seems well-structured for its purpose.
Summary of Findings
- Magic Numbers for Memory Reservation: In
python/sglang/srt/server_args.py
, the memory reservation values (e.g.,1024 * 18
,1024 * 20
) are magic numbers. Defining them as named constants would improve code readability and maintainability. - Attention Backend Initialization: In
python/sglang/srt/speculative/eagle_worker.py
(lines 696-699), when CUDA graph is not used for draft extend,init_forward_metadata
appears to be called on the main attention backend of the draft model runner, while the subsequent forward pass usesself.draft_extend_attn_backend
(viaforward_batch.attn_backend
). This could lead to inconsistencies.init_forward_metadata
should likely be called onself.draft_extend_attn_backend
.
Merge Readiness
The pull request introduces valuable performance enhancements with CUDA graph support for EAGLE draft extend. The core logic for the new CUDA graph runner and the integration into the EAGLE worker seem mostly correct.
However, there is one high-severity issue regarding attention backend initialization that should be addressed before merging to ensure correctness when CUDA graph is not used. Additionally, there's a medium-severity suggestion for improving maintainability by refactoring magic numbers related to memory reservation.
I am unable to approve this pull request. Please have another reviewer approve this code after addressing the identified issues.
self.draft_model_runner.attn_backend.init_forward_metadata(forward_batch) | ||
logits_output = self.draft_model_runner.model.forward( | ||
forward_batch.input_ids, forward_batch.positions, forward_batch | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems init_forward_metadata
is being called on self.draft_model_runner.attn_backend
(which is the main attention backend of the draft model runner), but the forward pass (line 697-699) will use forward_batch.attn_backend
.
forward_batch.attn_backend
is set to self.draft_extend_attn_backend
(via batch.get_model_worker_batch()
which sets attn_backend
appropriately for DRAFT_EXTEND
mode).
Shouldn't init_forward_metadata
be called on self.draft_extend_attn_backend
to ensure consistency with the backend used in the subsequent model.forward
call when CUDA graph is not used?
self.draft_model_runner.attn_backend.init_forward_metadata(forward_batch) | |
logits_output = self.draft_model_runner.model.forward( | |
forward_batch.input_ids, forward_batch.positions, forward_batch | |
) | |
self.draft_extend_attn_backend.init_forward_metadata(forward_batch) | |
logits_output = self.draft_model_runner.model.forward( | |
forward_batch.input_ids, forward_batch.positions, forward_batch | |
) |
reserve_mem = 1024 * 18 | ||
# need reserve more memory for spec cuda graph | ||
if self.speculative_algorithm is not None: | ||
reserve_mem = 1024 * 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values 1024 * 18
and 1024 * 20
for reserve_mem
are magic numbers. Could these be defined as named constants at the module level or within the class for better readability and maintainability? For example:
_DEFAULT_RESERVE_MEM_MB = 18 * 1024
_SPEC_GRAPH_ADDITIONAL_RESERVE_MEM_MB = 2 * 1024 # Additional for speculative graph
# ... later in __post_init__
reserve_mem = _DEFAULT_RESERVE_MEM_MB
if self.speculative_algorithm is not None:
reserve_mem += _SPEC_GRAPH_ADDITIONAL_RESERVE_MEM_MB
Or, keeping your current structure, just name the literals:
_BASE_RESERVE_MEM_FOR_CUDA_GRAPH_MB = 18 * 1024
_SPEC_RESERVE_MEM_FOR_CUDA_GRAPH_MB = 20 * 1024
reserve_mem = _BASE_RESERVE_MEM_FOR_CUDA_GRAPH_MB
if self.speculative_algorithm is not None:
reserve_mem = _SPEC_RESERVE_MEM_FOR_CUDA_GRAPH_MB
The accept rate issue should be fixed in 6707ea9.
|
The three parameters, speculative-num-steps, speculative-eagle-topk, and speculative-num-draft-tokens, have a significant impact on performance. Is the only way to select them in practice through continuous parameter tuning? |
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
Motivation
Add draft extend CUDA graph for EAGLE. FA3 backend is supported. Other backends will be supported in follow up PRs.
Co-authored-by: @kssteven418
Benchmark
DSV3
11%
per_user_throughput
improvement for bs=1 and 5% for bs=32.ref: #6606 (comment) and #6606 (comment)
main:
this PR:
Llama-3-8B
main
This PR:
Profile
Llama-3-8B
main:

This PR:

DSV3
main:

This PR:
