HAMburger: Accelerating LLM Inference via Token Smashing

Introduction

HAMburger is a hierachically auto-regressive model that can output multiple tokens per forward. Our approach reduces the growth of KV cache computation from linear to sub-linear w.r.t. the generation length and achieves a TPS speedup proportionally. On both standard tasks and long-context tasks, HAMburger achieves up to 2x TPS boost and 2x reduced KV cache computation (and storage) while maintaining or even surpassing the base model.

(HAMburger-1B on the left and Llama-3.2-1B-Instruct on the right)

Architecture

HAMburger stacks a standard LLM (e.g., Llama-3.2-1B-Instruct) with a relative-position-aware compositional embedder before it that smashes multiple tokens into one from the last step, and a micro-step decoder after it that outputs tokens with constant FLOPs.

Environment Setup

Use conda as below:

conda create -yn hamburger python=3.10.15
conda activate hamburger
pip3 install -r requirements.txt

To download our 1B model here, and change the path according to your downloaded place.

Training

HAMburger is instruction-finetuned with publicly available datasets and we provide both the training code and our trained checkpoints for full reproduction.

Here's the overall training flow:

Data Preparation

We prepared scripts for processing the data automatically and you can easily extend that by adding new datasets:

bash data_scripts/process.sh

Start Training

We trained our 1B model with 8xH100s. To reproduce our results, we suggest running:

python3 -m hamburger.train

Inference

We implement HAMburger on both GPT-Fast and HuggingFace for a balance of simplicity and performance.

Generation Demo

To run streaming demo, simply do the following and choose the option based on guidance:

python generate.py

To run GPT-Fast version, please read guidance here.

Evaluate Results

Standard tasks: Long Bench: All evaluation-related files are stored in ./eval. To run LongBench, simply run:

cd ./eval/long_bench
bash eval_long_bench.sh
python summarize_results.py # this is optional

To run standard tasks, we rely on lm_eval and evalplus:

First, we need to apply some patches to lm_eval by copying (overwritting) ./eval/standard/lm_eval_patch/* to your conda lm_eval/tasks.
Can can setup a server that has common API:

bash ./eval/launch_server.sh hf # for baseline
bash ./eval/launch_server.sh hamburger 0.8 # for hamburger

Run any commands found in ./eval/standard/client.sh for each individual task.

FAQ

How is HAMburger different from other multi-token LLMs such as Byte Latent Transformer, MegaByte, and BlockTransformer?

HAMburger outputs at the granularity of tokens instead of bytes, making it more efficient at inference than methods that are based of bytes.
HAMburger can dynamically decide how many tokens to generate at each macro-step instead of fixing a patch size, which can fully utilize the model's knowledge to maximally take advantage of a KV cache's capacity.
HAMburger is standalone and does not require serving a separate model for segmentation during inference.
HAMburger is the only instruct fine-tuned model, well evaluated on downstream chat tasks and ready to be used by many applications.

How do we compare HAMburger to speculative decoding?

HAMburger can be regarded as a special case of self-speculative decoding with several favorable features:

All drafted tokens from the micro-step decoder will be accepted from the last macro-step, and would stop generation otherwise.
A single forward (and hence a single KV cache) is required for verification regardless of the number of tokens produced from last step.
No alignment of the draft model is needed as the training itself has implicitly performed the process like MTP. This also greatly simplifies the serving complexity.
A tunable paramter called confidence-level is introduced to help trade-off quality and speedup.
Finally, we want to emphasize that HAMburger is extremely batch-friendly due to highly parallelizable grafted modules, which makes it suitable for high-throughput applications where speculative decoding can become less effective.

Citation

If you found our work to be useful and interesting, please consider citing:

@misc{liu2025hamburgeracceleratingllminference,
      title={HAMburger: Accelerating LLM Inference via Token Smashing}, 
      author={Jingyu Liu and Ce Zhang},
      year={2025},
      eprint={2505.20438},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20438}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 280 Commits
assets		assets
data_scripts		data_scripts
eval		eval
hamburger		hamburger
hamburger_gpt_fast		hamburger_gpt_fast
.gitignore		.gitignore
README.md		README.md
convert_dp3_ckpt.py		convert_dp3_ckpt.py
decode_logprobs.py		decode_logprobs.py
generate.py		generate.py
prefill_logprob.py		prefill_logprob.py
requirements.txt		requirements.txt
train.yaml		train.yaml
train_baseline.py		train_baseline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HAMburger: Accelerating LLM Inference via Token Smashing

Introduction

Architecture

Environment Setup

Training

Data Preparation

Start Training

Inference

Generation Demo

Evaluate Results

FAQ

Citation

About

Uh oh!

Releases

Packages

Languages

Jingyu6/hamburger

Folders and files

Latest commit

History

Repository files navigation

HAMburger: Accelerating LLM Inference via Token Smashing

Introduction

Architecture

Environment Setup

Training

Data Preparation

Start Training

Inference

Generation Demo

Evaluate Results

FAQ

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages