Skip to content

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Sep 1, 2024

Motivation

Previously fp8 kv cache for Flashinfer was supported in #1204.
This PR support this feature for Triton runtime.

Evaluation

DeepSeek-Coder-V2-Lite-Instruct

backend kv cache dtype gsm8k flexible-extract gsm8k strict-match
Flashinfer bf16 0.7703 0.7627
Flashinfer fp8_e5m2 0.7635 0.7491
Triton fp16 0.7839 0.7695
Triton fp8_e5m2 0.7627 0.7498

Meta-Llama-3.1-8B-Instruct

backend kv cache dtype gsm8k flexible-extract gsm8k strict-match
Flashinfer bf16 0.7771 0.7127
Flashinfer fp8_e5m2 0.7650 0.7043
Triton fp16 0.7817 0.7202
Triton fp8_e5m2 0.7710 0.7498
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --port 30000 --trust-remote-code --host 0.0.0.0 --tp=1 --enable-mla --kv-cache-dtype fp8_e5m2

lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,base_url=http://0.0.0.0:30000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs zhyncs added the enhancement New feature or request label Sep 1, 2024
@zhyncs
Copy link
Member

zhyncs commented Sep 1, 2024

Nice work!

@zhyncs
Copy link
Member

zhyncs commented Sep 1, 2024

ref #1156

@zhyncs
Copy link
Member

zhyncs commented Sep 1, 2024

hold on, we should also verify on H100

@zhyncs zhyncs merged commit 6cb32ef into sgl-project:main Sep 1, 2024
1 of 8 checks passed
@yzh119
Copy link
Collaborator

yzh119 commented Sep 2, 2024

Note: the current fp8 prefill kernels in flashinfer is a mixed-precision kernel (upcast fp8 K/V to fp16 inside kernels and use fp16 tensor cores), which is slow because of the online dequantization overhead, and it's different from the triton kernel logic here (where you downcast fp16 query to fp8 and use fp8 tensor cores).

The native fp8 prefill attention kernel in flashinfer will land together with fa3, where we use fp8 tensor cores, but the accuracy landscape might change (and might align with your triton implementation).

@zhyncs
Copy link
Member

zhyncs commented Sep 2, 2024

Note: the current fp8 prefill kernels in flashinfer is a mixed-precision kernel (upcast fp8 K/V to fp16 inside kernels and use fp16 tensor cores), which is slow because of the online dequantization overhead, and it's different from the triton kernel logic here (where you downcast fp16 query to fp8 and use fp8 tensor cores).

The native fp8 prefill attention kernel in flashinfer will land together with fa3, where we use fp8 tensor cores, but the accuracy landscape might change (and might align with your triton implementation).

Looking forward to it very much!

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants