Support Triton fp8 e5m2 kv cache #1286

ispobock · 2024-09-01T07:44:19Z

Motivation

Previously fp8 kv cache for Flashinfer was supported in #1204.
This PR support this feature for Triton runtime.

Evaluation

DeepSeek-Coder-V2-Lite-Instruct

backend	kv cache dtype	gsm8k flexible-extract	gsm8k strict-match
Flashinfer	bf16	0.7703	0.7627
Flashinfer	fp8_e5m2	0.7635	0.7491
Triton	fp16	0.7839	0.7695
Triton	fp8_e5m2	0.7627	0.7498

Meta-Llama-3.1-8B-Instruct

backend	kv cache dtype	gsm8k flexible-extract	gsm8k strict-match
Flashinfer	bf16	0.7771	0.7127
Flashinfer	fp8_e5m2	0.7650	0.7043
Triton	fp16	0.7817	0.7202
Triton	fp8_e5m2	0.7710	0.7498

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --port 30000 --trust-remote-code --host 0.0.0.0 --tp=1 --enable-mla --kv-cache-dtype fp8_e5m2

lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,base_url=http://0.0.0.0:30000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2024-09-01T07:45:29Z

Nice work!

zhyncs · 2024-09-01T07:45:54Z

ref #1156

zhyncs · 2024-09-01T07:49:39Z

hold on, we should also verify on H100

python/sglang/srt/layers/extend_attention.py

yzh119 · 2024-09-02T03:07:24Z

Note: the current fp8 prefill kernels in flashinfer is a mixed-precision kernel (upcast fp8 K/V to fp16 inside kernels and use fp16 tensor cores), which is slow because of the online dequantization overhead, and it's different from the triton kernel logic here (where you downcast fp16 query to fp8 and use fp8 tensor cores).

The native fp8 prefill attention kernel in flashinfer will land together with fa3, where we use fp8 tensor cores, but the accuracy landscape might change (and might align with your triton implementation).

zhyncs · 2024-09-02T07:13:12Z

Note: the current fp8 prefill kernels in flashinfer is a mixed-precision kernel (upcast fp8 K/V to fp16 inside kernels and use fp16 tensor cores), which is slow because of the online dequantization overhead, and it's different from the triton kernel logic here (where you downcast fp16 query to fp8 and use fp8 tensor cores).

The native fp8 prefill attention kernel in flashinfer will land together with fa3, where we use fp8 tensor cores, but the accuracy landscape might change (and might align with your triton implementation).

Looking forward to it very much!

Co-authored-by: Yineng Zhang <me@zhyncs.com>

triton fp8 kv cache

b3dfdc2

zhyncs requested review from Ying1123, merrymercy, zhyncs and hnyls2002 September 1, 2024 07:45

zhyncs assigned ispobock and zhyncs Sep 1, 2024

zhyncs added the enhancement New feature or request label Sep 1, 2024

zhyncs approved these changes Sep 1, 2024

View reviewed changes

zhyncs and others added 2 commits September 1, 2024 18:24

Merge branch 'main' into triton_fp8_kv

7680eb2

fix h100 smem issue

b2c65ef

zhyncs reviewed Sep 1, 2024

View reviewed changes

python/sglang/srt/layers/extend_attention.py Show resolved Hide resolved

Merge branch 'main' into triton_fp8_kv

948aa6c

zhyncs merged commit 6cb32ef into sgl-project:main Sep 1, 2024
1 of 8 checks passed

zhyncs mentioned this pull request Sep 22, 2024

Does Flashinfer support 8-bit attention calculation? flashinfer-ai/flashinfer#502

Closed

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Support Triton fp8 e5m2 kv cache (sgl-project#1286)

ccf862f

Co-authored-by: Yineng Zhang <me@zhyncs.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Triton fp8 e5m2 kv cache #1286

Support Triton fp8 e5m2 kv cache #1286

Uh oh!

ispobock commented Sep 1, 2024 •

edited by zhyncs

Loading

Uh oh!

zhyncs commented Sep 1, 2024

Uh oh!

zhyncs commented Sep 1, 2024

Uh oh!

zhyncs commented Sep 1, 2024

Uh oh!

Uh oh!

Uh oh!

yzh119 commented Sep 2, 2024

Uh oh!

zhyncs commented Sep 2, 2024

Uh oh!

Uh oh!

Support Triton fp8 e5m2 kv cache #1286

Support Triton fp8 e5m2 kv cache #1286

Uh oh!

Conversation

ispobock commented Sep 1, 2024 • edited by zhyncs Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Evaluation

Checklist

Uh oh!

zhyncs commented Sep 1, 2024

Uh oh!

zhyncs commented Sep 1, 2024

Uh oh!

zhyncs commented Sep 1, 2024

Uh oh!

Uh oh!

Uh oh!

yzh119 commented Sep 2, 2024

Uh oh!

zhyncs commented Sep 2, 2024

Uh oh!

Uh oh!

ispobock commented Sep 1, 2024 •

edited by zhyncs

Loading