flex-nano-vllm

FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.

Introduction

This project has no flash-attn dependency, no custom triton kernel. Everything is implemented with FlexAttention. The code is commented, the structure is flat. Stay tuned for a more detailed blog post.

Code Structure

flex-nano-vllm/
├── benchmark.py                   # Testing and benchmarking script.
├── benchmark_vllm.py              # vLLM comparison benchmark (uses uv inline dependency to run vLLM).
├── visualize.py                   # Performance visualization script.
└── flex_nano_vllm/
    ├── inference.py               # Main inference engine, uses paged attention.
    ├── modeling_gemma2.py         # Gemma2 model implementation, copied from transformers.
    └── paged_attention.py         # Paged attention implementation, including page table and paged kv cache. Based on attention-gym.

Quick Start

uv sync

# run test and benchmark
uv run benchmark.py

# compare with vllm
uv run benchmark_vllm.py

# enable profiling to save more metrics to a csv file
# ENABLE_PROFILING=1 uv run benchmark_vllm.py

Results

Test configuration:

PyTorch version: 2.7.1+cu128
GPU: RTX 3090 x 1 (24GB)
Model: google/gemma-2-2b
Workload: 512 requests, max 512 input tokens, variable output tokens (128-512)
Configs tested: vLLM at 50% & 90% GPU memory, flex-nano-vllm with same page allocation as vLLM

Implementation	Output Tokens/s	Request/s	Total Throughput*
vLLM v1, 90% GPU memory, high batch size†	3,840	17.67	7,234
vLLM v1, 90% GPU memory	3,772	15.26	6,401
flex-nano-vllm, 90% GPU memory, high batch size†	3,440	14.30	5,817
flex-nano-vllm, 90% GPU memory	3,076	13.06	5,382
vLLM v1, 50% GPU memory	3,020	13.74	5,448
flex-nano-vllm, 50% GPU memory	2,313	9.96	4,068

*Total throughput includes both input and output tokens
† High batch size means max_num_seqs=512 in vllm (maximum allowed concurrency)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Third-party code incorporated in this project retains its original licenses. See THIRD_PARTY_LICENSES.md for details.

Acknowledgments

GeeeekExplorer/nano-vllm: this project is inspired by nano-vllm.
pytorch-labs/attention-gym: The paged attention implementation is based on attention-gym.
huggingface/transformers: I copied the gemma2 model from transformers and modified it to use flex attention / paged attention.
vllm-project/vllm: vLLM has support for flex attention backend, which helped me find a useful flag in flex_attention.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
flex_nano_vllm		flex_nano_vllm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
benchmark.py		benchmark.py
benchmark_vllm.py		benchmark_vllm.py
metrics_comparison.png		metrics_comparison.png
plot_metrics.py		plot_metrics.py
pyproject.toml		pyproject.toml
tokens_per_second_comparison.png		tokens_per_second_comparison.png
uv.lock		uv.lock
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

flex-nano-vllm

Introduction

Code Structure

Quick Start

Results

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

changjonathanc/flex-nano-vllm

Folders and files

Latest commit

History

Repository files navigation

flex-nano-vllm

Introduction

Code Structure

Quick Start

Results

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages