FlashInfer: Kernel Library for LLM Serving
-
Updated
Aug 29, 2025 - Cuda
FlashInfer: Kernel Library for LLM Serving
Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
Code for the paper "Cottention: Linear Transformers With Cosine Attention"
Patch-Based Stochastic Attention (efficient attention mecanism)
easy naive flash attention without optimization base on origin paper
A simple implementation of PagedAttention purely written in CUDA and C++.
Add a description, image, and links to the attention topic page so that developers can more easily learn about it.
To associate your repository with the attention topic, visit your repo's landing page and select "manage topics."