-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Support Triton fp8 e5m2 kv cache #1286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice work! |
ref #1156 |
hold on, we should also verify on H100 |
Note: the current fp8 prefill kernels in flashinfer is a mixed-precision kernel (upcast fp8 K/V to fp16 inside kernels and use fp16 tensor cores), which is slow because of the online dequantization overhead, and it's different from the triton kernel logic here (where you downcast fp16 query to fp8 and use fp8 tensor cores). The native fp8 prefill attention kernel in flashinfer will land together with fa3, where we use fp8 tensor cores, but the accuracy landscape might change (and might align with your triton implementation). |
Looking forward to it very much! |
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Motivation
Previously fp8 kv cache for Flashinfer was supported in #1204.
This PR support this feature for Triton runtime.
Evaluation
DeepSeek-Coder-V2-Lite-Instruct
Meta-Llama-3.1-8B-Instruct
Checklist