Support EAGLE draft extend CUDA graph #6606

ispobock · 2025-05-26T03:39:31Z

Motivation

Add draft extend CUDA graph for EAGLE. FA3 backend is supported. Other backends will be supported in follow up PRs.

Benchmark

DSV3

11% per_user_throughput improvement for bs=1 and 5% for bs=32.

ref: #6606 (comment) and #6606 (comment)

python3 -m sglang.launch_server --model /dev/shm/DeepSeek-V3-0324 --tp 8 --trust-remote-code --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-algorithm EAGLE

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 5 --max-concurrency 1 --output-file dsv3.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 20 --max-concurrency 4 --output-file dsv3.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 80 --max-concurrency 16 --output-file dsv3.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 160 --max-concurrency 32 --output-file dsv3.jsonl

main:

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            123.947 |             123.947 |        138.884 |          139.747 |       147.518 |          7.936 |            7.755 |         9.210 |               123.947 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            376.367 |             376.367 |        330.610 |          163.647 |      1278.265 |          9.913 |            9.713 |        12.412 |                94.092 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |            982.446 |             982.446 |        504.908 |          179.340 |      1873.442 |         14.922 |           14.710 |        18.929 |                61.403 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |           1660.563 |            1660.563 |        359.882 |          183.473 |      1569.568 |         18.120 |           18.424 |        23.038 |                51.893 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

this PR:

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            139.708 |             139.708 |        154.290 |          149.956 |       176.988 |          7.009 |            6.866 |         8.121 |               139.708 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            413.854 |             413.854 |        346.852 |          163.505 |      1390.936 |          9.030 |            8.802 |        11.247 |               103.464 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |           1038.688 |            1038.688 |        524.877 |          174.334 |      1913.235 |         13.883 |           13.858 |        17.762 |                64.918 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |           1740.612 |            1740.612 |        357.374 |          178.539 |      1563.505 |         17.306 |           17.497 |        22.053 |                54.394 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

Llama-3-8B

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 --trust-remote-code --dtype float16 --attention-backend fa3

main

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            210.866 |             210.866 |         41.137 |           35.311 |        61.458 |          4.705 |            4.516 |         5.600 |               210.866 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            773.532 |             773.532 |         55.434 |           42.459 |       133.661 |          4.859 |            4.857 |         6.040 |               193.383 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |           2371.987 |            2371.987 |        100.885 |           44.230 |       418.474 |          5.754 |            5.661 |         7.655 |               148.249 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |           4313.602 |            4313.602 |        168.698 |           44.833 |       953.043 |          6.773 |            6.650 |         9.297 |               134.800 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

This PR:

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            221.192 |             221.192 |         44.887 |           40.369 |        75.675 |          4.479 |            4.292 |         5.315 |               221.192 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            805.923 |             805.923 |         80.402 |           42.115 |       301.355 |          4.640 |            4.716 |         5.763 |               201.481 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |           2518.878 |            2518.878 |        103.171 |           42.926 |       430.251 |          5.417 |            5.338 |         7.140 |               157.430 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |           4473.389 |            4473.389 |        165.690 |           43.676 |       945.616 |          6.514 |            6.396 |         8.787 |               139.793 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

Profile

Llama-3-8B

main:

This PR:

DSV3

main:

This PR:

Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>

python/sglang/srt/layers/attention/flashattention_backend.py

zhyncs · 2025-05-26T22:08:55Z

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            123.947 |             123.947 |        138.884 |          139.747 |       147.518 |          7.936 |            7.755 |         9.210 |               123.947 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            376.367 |             376.367 |        330.610 |          163.647 |      1278.265 |          9.913 |            9.713 |        12.412 |                94.092 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |            982.446 |             982.446 |        504.908 |          179.340 |      1873.442 |         14.922 |           14.710 |        18.929 |                61.403 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |           1660.563 |            1660.563 |        359.882 |          183.473 |      1569.568 |         18.120 |           18.424 |        23.038 |                51.893 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  4 |             1.000 |            137.814 |             137.814 |        148.209 |          150.231 |       162.498 |          7.113 |            7.011 |         8.425 |               137.814 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  5 |             4.000 |            401.329 |             401.329 |        383.695 |          166.587 |      1617.817 |          9.327 |            9.155 |        11.889 |               100.332 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  6 |            16.000 |            950.423 |             950.423 |        665.985 |          173.532 |      2647.285 |         15.058 |           14.964 |        19.418 |                59.401 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  7 |            32.000 |           1592.793 |            1592.793 |        424.593 |          180.001 |      1891.864 |         18.747 |           18.875 |        25.483 |                49.775 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

python3 -m sglang.launch_server --model /dev/shm/DeepSeek-V3-0324 --tp 8 --trust-remote-code --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --speculative-algorithm EAGLE

curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 5 --max-concurrency 1 --output-file dsv3.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 20 --max-concurrency 4 --output-file dsv3.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 80 --max-concurrency 16 --output-file dsv3.jsonl
curl http://127.0.0.1:30000/flush_cache
python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 1000 --random-output-len 1000 --random-range-ratio 1 --num-prompts 160 --max-concurrency 32 --output-file dsv3.jsonl

From the results, it can be seen that performance is better when batch size is 1 or 4, and worse when batch size is 16 or 32 compared to the main.

zhyncs · 2025-05-27T03:37:54Z

python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py

+    from sglang.srt.speculative.eagle_worker import EAGLEWorker
+
+
+class EAGLEDraftExtendCudaGraphRunner:


BTW when enabling extended CUDA Graph, we should also adjust mem_fraction_static to avoid running out of memory. For example, set it to 5/4/8.

python3 -m sglang.launch_server --model /dev/shm/DeepSeek-V3-0324 --tp 8 --trust-remote-code --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --speculative-algorithm EAGLE lm_eval --model local-chat-completions --model_args model=/dev/shm/DeepSeek-V3-0324,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks gsm8k --batch_size 128 --apply_chat_template --num_fewshot 8

This will be OOM.

ispobock · 2025-05-27T03:57:34Z

Performance improved after 9055a49.

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |            16.000 |            996.351 |             996.351 |        355.755 |          173.577 |      1274.461 |         14.697 |           14.686 |        18.097 |                62.272 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |            32.000 |           1617.969 |            1617.969 |        398.944 |          177.661 |      1734.676 |         18.596 |           18.751 |        25.125 |                50.562 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

ispobock · 2025-05-27T06:53:48Z

There is an accept rate issue on large batch size. It should be fixed before merge.

zhyncs · 2025-05-27T07:22:30Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces CUDA graph support for EAGLE draft extend, which is a significant step towards improving performance for speculative decoding. The changes look promising, with performance gains demonstrated in the PR description.

I've identified a couple of areas for improvement, primarily related to maintainability and a potential correctness issue in attention backend initialization. Overall, the new EAGLEDraftExtendCudaGraphRunner class seems well-structured for its purpose.

Summary of Findings

Magic Numbers for Memory Reservation: In python/sglang/srt/server_args.py, the memory reservation values (e.g., 1024 * 18, 1024 * 20) are magic numbers. Defining them as named constants would improve code readability and maintainability.
Attention Backend Initialization: In python/sglang/srt/speculative/eagle_worker.py (lines 696-699), when CUDA graph is not used for draft extend, init_forward_metadata appears to be called on the main attention backend of the draft model runner, while the subsequent forward pass uses self.draft_extend_attn_backend (via forward_batch.attn_backend). This could lead to inconsistencies. init_forward_metadata should likely be called on self.draft_extend_attn_backend.

Merge Readiness

The pull request introduces valuable performance enhancements with CUDA graph support for EAGLE draft extend. The core logic for the new CUDA graph runner and the integration into the EAGLE worker seem mostly correct.

However, there is one high-severity issue regarding attention backend initialization that should be addressed before merging to ensure correctness when CUDA graph is not used. Additionally, there's a medium-severity suggestion for improving maintainability by refactoring magic numbers related to memory reservation.

I am unable to approve this pull request. Please have another reviewer approve this code after addressing the identified issues.

gemini-code-assist · 2025-05-27T07:24:19Z

python/sglang/srt/speculative/eagle_worker.py

+            self.draft_model_runner.attn_backend.init_forward_metadata(forward_batch)
+            logits_output = self.draft_model_runner.model.forward(
+                forward_batch.input_ids, forward_batch.positions, forward_batch
+            )


It seems init_forward_metadata is being called on self.draft_model_runner.attn_backend (which is the main attention backend of the draft model runner), but the forward pass (line 697-699) will use forward_batch.attn_backend.

forward_batch.attn_backend is set to self.draft_extend_attn_backend (via batch.get_model_worker_batch() which sets attn_backend appropriately for DRAFT_EXTEND mode).

Shouldn't init_forward_metadata be called on self.draft_extend_attn_backend to ensure consistency with the backend used in the subsequent model.forward call when CUDA graph is not used?

Suggested change

self.draft_model_runner.attn_backend.init_forward_metadata(forward_batch)

logits_output = self.draft_model_runner.model.forward(

forward_batch.input_ids, forward_batch.positions, forward_batch

)

self.draft_extend_attn_backend.init_forward_metadata(forward_batch)

logits_output = self.draft_model_runner.model.forward(

forward_batch.input_ids, forward_batch.positions, forward_batch

)

gemini-code-assist · 2025-05-27T07:24:19Z

python/sglang/srt/server_args.py

+                reserve_mem = 1024 * 18
+                # need reserve more memory for spec cuda graph
+                if self.speculative_algorithm is not None:
+                    reserve_mem = 1024 * 20


The values 1024 * 18 and 1024 * 20 for reserve_mem are magic numbers. Could these be defined as named constants at the module level or within the class for better readability and maintainability? For example:

_DEFAULT_RESERVE_MEM_MB = 18 * 1024 _SPEC_GRAPH_ADDITIONAL_RESERVE_MEM_MB = 2 * 1024 # Additional for speculative graph # ... later in __post_init__ reserve_mem = _DEFAULT_RESERVE_MEM_MB if self.speculative_algorithm is not None: reserve_mem += _SPEC_GRAPH_ADDITIONAL_RESERVE_MEM_MB

Or, keeping your current structure, just name the literals:

_BASE_RESERVE_MEM_FOR_CUDA_GRAPH_MB = 18 * 1024 _SPEC_RESERVE_MEM_FOR_CUDA_GRAPH_MB = 20 * 1024 reserve_mem = _BASE_RESERVE_MEM_FOR_CUDA_GRAPH_MB if self.speculative_algorithm is not None: reserve_mem = _SPEC_RESERVE_MEM_FOR_CUDA_GRAPH_MB

ispobock · 2025-05-27T09:27:37Z

The accept rate issue should be fixed in 6707ea9.
Current benchmark results (11% per_user_throughput improvement for bs=1 and 5% for bs=32):

+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|    |   max_concurrency |   input_throughput |   output_throughput |   mean_ttft_ms |   median_ttft_ms |   p99_ttft_ms |   mean_tpot_ms |   median_tpot_ms |   p99_tpot_ms |   per_user_throughput |
+====+===================+====================+=====================+================+==================+===============+================+==================+===============+=======================+
|  0 |             1.000 |            139.708 |             139.708 |        154.290 |          149.956 |       176.988 |          7.009 |            6.866 |         8.121 |               139.708 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  1 |             4.000 |            413.854 |             413.854 |        346.852 |          163.505 |      1390.936 |          9.030 |            8.802 |        11.247 |               103.464 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  2 |            16.000 |           1038.688 |            1038.688 |        524.877 |          174.334 |      1913.235 |         13.883 |           13.858 |        17.762 |                64.918 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+
|  3 |            32.000 |           1740.612 |            1740.612 |        357.374 |          178.539 |      1563.505 |         17.306 |           17.497 |        22.053 |                54.394 |
+----+-------------------+--------------------+---------------------+----------------+------------------+---------------+----------------+------------------+---------------+-----------------------+

Z-NAVY · 2025-05-29T12:41:57Z

The three parameters, speculative-num-steps, speculative-eagle-topk, and speculative-num-draft-tokens, have a significant impact on performance. Is the only way to select them in practice through continuous parameter tuning?

Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>

ispobock and others added 2 commits May 25, 2025 17:46

add eagle draft extend cuda graph

29fd0b3

Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>

update for fa3 backend

50a96f5

ispobock requested review from Ying1123, merrymercy, rkooo567, kssteven418, zhyncs, HaiShaw, ch-wan and BBuf as code owners May 26, 2025 03:39

ispobock and others added 2 commits May 26, 2025 11:39

Merge branch 'main' into draft-extend-cg

fe3b0aa

fix lint

19aeb04

ispobock requested a review from zhaochenyang20 as a code owner May 26, 2025 03:46

Merge branch 'main' into draft-extend-cg

ed5c62d

zhyncs self-assigned this May 26, 2025

zhyncs added the high priority label May 26, 2025

Merge branch 'main' into draft-extend-cg

15f9897

zhyncs assigned hebiao064, qingquansong and Fridge003 May 26, 2025

zhyncs and others added 2 commits May 25, 2025 23:09

Merge branch 'main' into draft-extend-cg

e9814b8

Merge branch 'main' into draft-extend-cg

a8172b4

hebiao064 reviewed May 26, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Show resolved Hide resolved

ispobock and others added 2 commits May 26, 2025 23:06

Merge branch 'main' into draft-extend-cg

40b916b

Merge branch 'main' into draft-extend-cg

0e9114f

fix performance issue

9055a49

zhyncs reviewed May 27, 2025

View reviewed changes

Merge branch 'main' into draft-extend-cg

e9a9fd2

adjust reserve memory for cuda graph

e610802

ispobock requested review from hnyls2002 and ByronHsu as code owners May 27, 2025 06:19

ispobock added the not-to-merge label May 27, 2025

gemini-code-assist bot suggested changes May 27, 2025

View reviewed changes

fix acc rate

6707ea9

ispobock removed the not-to-merge label May 27, 2025

fix lint

cd40175

Merge branch 'main' into draft-extend-cg

0aac2ac

zhyncs approved these changes May 27, 2025

View reviewed changes

Merge branch 'main' into draft-extend-cg

130bfc3

zhyncs merged commit 6319502 into main May 27, 2025
1 check failed

zhyncs deleted the draft-extend-cg branch May 27, 2025 09:35

ispobock mentioned this pull request May 28, 2025

Add draft extend CUDA graph for Triton backend #6705

Merged

ispobock mentioned this pull request Jun 2, 2025

Add draft extend CUDA graph for flashinfer backend #6805

Merged

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025

Support EAGLE draft extend CUDA graph (sgl-project#6606)

b1bf1c1

Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>

liangjason87 mentioned this pull request Jun 9, 2025

[Bug] Draft extend CUDA graph fails to load using llama 3.3 70b #7011

Closed

5 tasks

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

Support EAGLE draft extend CUDA graph (sgl-project#6606)

d29bd61

Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support EAGLE draft extend CUDA graph #6606

Support EAGLE draft extend CUDA graph #6606

Uh oh!

ispobock commented May 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

zhyncs commented May 26, 2025

Uh oh!

zhyncs May 27, 2025

Uh oh!

zhyncs May 27, 2025

Uh oh!

ispobock commented May 27, 2025 •

edited

Loading

Uh oh!

ispobock commented May 27, 2025

Uh oh!

zhyncs commented May 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot May 27, 2025

Uh oh!

gemini-code-assist bot May 27, 2025

Uh oh!

ispobock commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Z-NAVY commented May 29, 2025

Uh oh!

Uh oh!

		from sglang.srt.speculative.eagle_worker import EAGLEWorker


		class EAGLEDraftExtendCudaGraphRunner:

Support EAGLE draft extend CUDA graph #6606

Support EAGLE draft extend CUDA graph #6606

Uh oh!

Conversation

ispobock commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark

DSV3

Llama-3-8B

Profile

Llama-3-8B

DSV3

Uh oh!

Uh oh!

zhyncs commented May 26, 2025

Uh oh!

zhyncs May 27, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs May 27, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ispobock commented May 27, 2025

Uh oh!

zhyncs commented May 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot May 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 27, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Z-NAVY commented May 29, 2025

Uh oh!

Uh oh!

ispobock commented May 26, 2025 •

edited

Loading

ispobock commented May 27, 2025 •

edited

Loading

ispobock commented May 27, 2025 •

edited

Loading