FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
This project has no flash-attn dependency, no custom triton kernel. Everything is implemented with FlexAttention. The code is commented, the structure is flat. Stay tuned for a more detailed blog post.
flex-nano-vllm/
├── benchmark.py # Testing and benchmarking script.
├── benchmark_vllm.py # vLLM comparison benchmark (uses uv inline dependency to run vLLM).
├── visualize.py # Performance visualization script.
└── flex_nano_vllm/
├── inference.py # Main inference engine, uses paged attention.
├── modeling_gemma2.py # Gemma2 model implementation, copied from transformers.
└── paged_attention.py # Paged attention implementation, including page table and paged kv cache. Based on attention-gym.
uv sync
# run test and benchmark
uv run benchmark.py
# compare with vllm
uv run benchmark_vllm.py
# enable profiling to save more metrics to a csv file
# ENABLE_PROFILING=1 uv run benchmark_vllm.py
Test configuration:
- PyTorch version: 2.7.1+cu128
- GPU: RTX 3090 x 1 (24GB)
- Model: google/gemma-2-2b
- Workload: 512 requests, max 512 input tokens, variable output tokens (128-512)
- Configs tested: vLLM at 50% & 90% GPU memory, flex-nano-vllm with same page allocation as vLLM
Implementation | Output Tokens/s | Request/s | Total Throughput* |
---|---|---|---|
vLLM v1, 90% GPU memory, high batch size† | 3,840 | 17.67 | 7,234 |
vLLM v1, 90% GPU memory | 3,772 | 15.26 | 6,401 |
flex-nano-vllm, 90% GPU memory, high batch size† | 3,440 | 14.30 | 5,817 |
flex-nano-vllm, 90% GPU memory | 3,076 | 13.06 | 5,382 |
vLLM v1, 50% GPU memory | 3,020 | 13.74 | 5,448 |
flex-nano-vllm, 50% GPU memory | 2,313 | 9.96 | 4,068 |
*Total throughput includes both input and output tokens
† High batch size means max_num_seqs=512 in vllm (maximum allowed concurrency)
This project is licensed under the MIT License - see the LICENSE file for details.
Third-party code incorporated in this project retains its original licenses. See THIRD_PARTY_LICENSES.md for details.
- GeeeekExplorer/nano-vllm: this project is inspired by nano-vllm.
- pytorch-labs/attention-gym: The paged attention implementation is based on attention-gym.
- huggingface/transformers: I copied the gemma2 model from transformers and modified it to use flex attention / paged attention.
- vllm-project/vllm: vLLM has support for flex attention backend, which helped me find a useful flag in flex_attention.