cd DREAM
pip install -r requirements.txt
# Download weight demo based on llava-v1.6-vicuna-7b
git clone https://huggingface.co/Alexhu1999/DREAM-llava-v1.6-vicuna-7b
The inference code we provide automatically allocates model weights (loading a model across multiple GPUs), allowing you to run models that exceed the memory of a single GPU.
We have provided a suggested web interface, which you can use by running the following command. After the model is fully loaded, a URL will be output in the terminal, which you can enter into your browser to access.
python -m dream.application.webui --ea-model-path /home/apc/models/DREAM-Vicuna-7B-v1.3 --base-model-path /home/apc/models/vicuna-7b-v1.3 --model-type vicuna --total-token 8
The total-token is the number of draft tokens. For smaller models and advanced GPUs, this value can be set larger. Adjusting according to the specific device and model can achieve better results. If set to -1, DREAM will automatically configure this parameter.
You can run the following command to generate the training data.
python -m dream.ge_data.allocation --outdir [path of data]
cd dream/model
deepspeed main_deepspeed.py --deepspeed_config /home/apc/DREAM/dream/train/ds_config.json --tmpdir /home/apc/Bingle/data/llava_vicuna_mmt_0/12_data/sharegpt_0_7999_mufp16 --cpdir /home/apc/DREAM/dream/train/vicuna-7b-ckpt --configpath /home/apc/DREAM/dream/train/vicuna_7B_config.json
You can test the speed of DREAM on MT-bench using the following command.
python -m dream.evaluation.eval_llava\
--ea-model-path [path of DREAM weight]\
--base-model-path [path of the original model]\
The above two commands will each generate a .jsonl file that records the generation results and wall time.
This project has been influenced by many excellent projects in the LLM community, such as Medusa, EAGLE, FastChat, and others. We first release LLaVA version, others will merge together soon.
If you find our work useful, please consider citing:
@misc{hu2025dreamdraftingrefinedtarget,
title={DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding},
author={Yunhai Hu and Tianhua Xia and Zining Liu and Rahul Raman and Xingyu Liu and Bo Bao and Eric Sather and Vithursan Thangarasa and Sai Qian Zhang},
year={2025},
eprint={2505.19201},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.19201},
}