Unofficial Implementation of EAGLE Speculative Decoding.
Read our launch announcement: https://frugalgpu.substack.com/p/introducing-baldeagle
Read our guide on how to train your own EAGLE model: https://frugalgpu.substack.com/p/how-to-train-your-own-eagle-speculative
- Clean model implementation on top of HuggingFace Transformers that can be replicated for all models
- Abstracts away attention, causal mask, etc.
- Training loop is implemented using HuggingFace Trainer for more readable and modular code
- Easily modify learning rate scheduler
- Abstracts away gradient accumulation, autocasting, checkpointing, logging, resuming, etc.
- Improved data generation scripts that modularizes data formatting, tokenization, and loss mask generation
- Easy to switch to other datasets and tokenizers
- Ultrachat and ShareGPT implementations already included
view_data.py
script that shows loss mask on original text for validation purposes (see here for more details)
Target Model | BaldEagle Model |
---|---|
Llama-3.1-8B-Instruct | BaldEagle-Llama-3.1-8B-Instruct |
Qwen-2.5-7B-Instruct | BaldEagle-Qwen-2.5-7B-Instruct |
Note: Data requires a significant amount of disk space since we're saving sequence_length x hidden_dim for each sample. ShareGPT (68k rows) requires ~650GB and Ultrachat (200k rows) requires ~2TB
- Edit
generate_data.py
for the dataset and model you are using.- Section 1 is focused on the dataset and reformatting it if necessary; by default we use Ultrachat and ShareGPT is availble in the commented blocks
- Section 2 tokenizes and generates the loss mask based on the tokenizer's chat template.
- In
allocation.py
set the GPU's you want to use for data generation- This will split the data and call
generate_data.py
on separate slices on different GPUs - Modify
outdir
variable
- This will split the data and call
- Call
allocation.py
while specifying the output directory--outdir
- ie.
python allocation.py --outdir {output_directory}
- ie.
- In
train.py
, modify the necessary variables- Specify
path
to a local path for the main model you're training for - Modify the datapaths in the
Load data
section to match your data paths from the previous section - Modify any trainer parameters
- Specify
- Launch the training script on 1 GPU with
python3 train.py
Currently, training-time test from Eagle 3 paper is being worked on in the train/train_eagle_ttt.py
and train/modules/trainer/trainer_eagle_ttt.py
files.
Eagle 2 + Training Time Test Model: https://huggingface.co/NickL77/BaldEagle-TTT-Llama-3.1-8B-Instruct-alpha
- 11.7% faster, 8.4% greater acceptance rate than Eagle 2 baseline
Fused features requires new data generation and EAGLE 3 trains on target model generations rather than fixed dataset, which EAGLE 1 does. Fused features will require
- [Experimental ]new data generation to extract high, medium, and low features
- this will require 3x more storage
- currently,
generate_data_fused_features.py
can generate low, mid, and high features- this is based on EAGLE repos's layer selection here
- faster data generation since target model generation will be required
- ideally we can use a faster inference server like VLLM or sglang rather than huggingface
- modifications to model and trainer code for feature fusion
Feel free to open an issue to discuss implementation and results!
If you found this project useful, please cite this with:
Liu, N. (2025). BaldEagle (Version 1.0.0) [Computer software]. https://github.com/NickL77/BaldEagle/
or
@software{Liu_BaldEagle_2025,
title = {BaldEagle},
author = {Liu, Nicholas},
year = {2025},
month = {May},
url = {https://github.com/NickL77/BaldEagle/},
license = {MIT},
version = {1.0.0}
}