Skip to content

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Jan 31, 2025

Motivation

Fix torch compile for block wise fp8 linear layer.

python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3  --trust-remote-code --tp 8

Prefill. latency: 1.82421 s, throughput:     70.17 token/s
Decode.  latency: 1.75760 s, throughput:      0.57 token/s
Decode.  latency: 0.02740 s, throughput:     36.49 token/s
Decode.  latency: 0.02703 s, throughput:     37.00 token/s
Decode.  latency: 0.02711 s, throughput:     36.88 token/s
Decode.  latency: 0.02711 s, throughput:     36.89 token/s
Decode.  median latency: 0.02711 s, median throughput:     36.88 token/s
Total. latency:  3.745 s, throughput:     36.32 token/s
Benchmark ...
Prefill. latency: 0.16921 s, throughput:    756.48 token/s
Decode.  latency: 0.02716 s, throughput:     36.81 token/s
Decode.  latency: 0.02713 s, throughput:     36.86 token/s
Decode.  latency: 0.02713 s, throughput:     36.86 token/s
Decode.  latency: 0.02714 s, throughput:     36.85 token/s
Decode.  latency: 0.02716 s, throughput:     36.82 token/s
Decode.  median latency: 0.02719 s, median throughput:     36.78 token/s
Total. latency:  7.107 s, throughput:     54.03 token/s


python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3  --trust-remote-code --tp 8 --enable-torch-compile --torch-compile-max-bs 1
Prefill. latency: 1.85489 s, throughput:     69.01 token/s
Decode.  latency: 0.34461 s, throughput:      2.90 token/s
Decode.  latency: 0.02107 s, throughput:     47.46 token/s
Decode.  latency: 0.02078 s, throughput:     48.13 token/s
Decode.  latency: 0.02073 s, throughput:     48.23 token/s
Decode.  latency: 0.02075 s, throughput:     48.20 token/s
Decode.  median latency: 0.02078 s, median throughput:     48.13 token/s
Total. latency:  2.325 s, throughput:     58.50 token/s
Benchmark ...
Prefill. latency: 0.17728 s, throughput:    722.03 token/s
Decode.  latency: 0.02077 s, throughput:     48.15 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02074 s, throughput:     48.22 token/s
Decode.  median latency: 0.02092 s, median throughput:     47.81 token/s
Total. latency:  5.497 s, throughput:     69.86 token/s

@zhyncs zhyncs merged commit c02e313 into sgl-project:main Jan 31, 2025
1 of 14 checks passed
@ispobock
Copy link
Collaborator Author

Accuracy:

python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 1

Accuracy: 0.950
Invalid: 0.000
Latency: 452.187 s
Output throughput: 43.175 token/s

@zhyncs zhyncs mentioned this pull request Jan 31, 2025
4 tasks
@Jasmine-up
Copy link

I can't replicate your result of 50 tokens/s. Could you please tell me which machine you are using?

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
@zhyncs zhyncs mentioned this pull request Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants