HunyuanVideo-Avatar 🌅

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

🔥🔥🔥 News!!

Jun 06, 2025: 🔥 HunyuanVideo-Avatar supports Single GPU with only 10GB VRAM, with TeaCache included, HUGE THANKS to Wan2GP
May 28, 2025: 🔥 HunyuanVideo-Avatar is available in Cloud-Native-Build (CNB) HunyuanVideo-Avatar.
May 28, 2025: 👋 We release the inference code and model weights of HunyuanVideo-Avatar. Download.

📑 Open-source Plan

HunyuanVideo-Avatar
- Inference
- Checkpoints
- ComfyUI

Abstract

Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios. The source code and model weights will be released publicly.

HunyuanVideo-Avatar Overall Architecture

We propose HunyuanVideo-Avatar, a multi-modal diffusion transformer(MM-DiT)-based model capable of generating dynamic, emotion-controllable, and multi-character dialogue videos.

🎉 HunyuanVideo-Avatar Key Features

High-Dynamic and Emotion-Controllable Video Generation

HunyuanVideo-Avatar supports animating any input avatar images to high-dynamic and emotion-controllable videos with simple audio conditions. Specifically, it takes as input multi-style avatar images at arbitrary scales and resolutions. The system supports multi-style avatars encompassing photorealistic, cartoon, 3D-rendered, and anthropomorphic characters. Multi-scale generation spanning portrait, upper-body and full-body. It generates videos with high-dynamic foreground and background, achieving superior realistic and naturalness. In addition, the system supports controlling facial emotions of the characters conditioned on input audio.

Various Applications

HunyuanVideo-Avatar supports various downstream tasks and applications. For instance, the system generates talking avatar videos, which could be applied to e-commerce, online streaming, social media video production, etc. In addition, its multi-character animation feature enlarges the application such as video content creation, editing, etc.

📜 Requirements

An NVIDIA GPU with CUDA support is required.
- The model is tested on a machine with 8GPUs.
- Minimum: The minimum GPU memory required is 24GB for 704px768px129f but very slow.
- Recommended: We recommend using a GPU with 96GB of memory for better generation quality.
- Tips: If OOM occurs when using GPU with 80GB of memory, try to reduce the image resolution.
Tested operating system: Linux

🛠️ Dependencies and Installation

Begin by cloning the repository:

git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar.git
cd HunyuanVideo-Avatar

Installation Guide for Linux

We recommend CUDA versions 12.4 or 11.8 for the manual installation.

Conda's installation instructions are available here.

# 1. Create conda environment
conda create -n HunyuanVideo-Avatar python==3.10.9

# 2. Activate the environment
conda activate HunyuanVideo-Avatar

# 3. Install PyTorch and other dependencies using conda
# For CUDA 11.8
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# 4. Install pip dependencies
python -m pip install -r requirements.txt
# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

In case of running into float point exception(core dump) on the specific GPU type, you may try the following solutions:

# Option 1: Making sure you have installed CUDA 12.4, CUBLAS>=12.4.5.8, and CUDNN>=9.00 (or simply using our CUDA 12 docker image).
pip install nvidia-cublas-cu12==12.4.5.8
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/

# Option 2: Forcing to explicitly use the CUDA 11.8 compiled version of Pytorch and all the other packages
pip uninstall -r requirements.txt  # uninstall all packages
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install ninja
pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

Additionally, you can also use HunyuanVideo Docker image. Use the following command to pull and run the docker image.

# For CUDA 12.4 (updated to avoid float point exception)
docker pull hunyuanvideo/hunyuanvideo:cuda_12
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2

# For CUDA 11.8
docker pull hunyuanvideo/hunyuanvideo:cuda_11
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2

🧱 Download Pretrained Models

The details of download pretrained models are shown here.

🚀 Parallel Inference on Multiple GPUs

For example, to generate a video with 8 GPUs, you can use the following command:

cd HunyuanVideo-Avatar

JOBS_DIR=$(dirname $(dirname "$0"))
export PYTHONPATH=./
export MODEL_BASE="./weights"
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt

torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
    --input 'assets/test.csv' \
    --ckpt ${checkpoint_path} \
    --sample-n-frames 129 \
    --seed 128 \
    --image-size 704 \
    --cfg-scale 7.5 \
    --infer-steps 50 \
    --use-deepcache 1 \
    --flow-shift-eval-video 5.0 \
    --save-path ${OUTPUT_BASEPATH}

🔑 Single-gpu Inference

For example, to generate a video with 1 GPU, you can use the following command:

cd HunyuanVideo-Avatar

JOBS_DIR=$(dirname $(dirname "$0"))
export PYTHONPATH=./

export MODEL_BASE=./weights
OUTPUT_BASEPATH=./results-single
checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt

export DISABLE_SP=1 
CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py \
    --input 'assets/test.csv' \
    --ckpt ${checkpoint_path} \
    --sample-n-frames 129 \
    --seed 128 \
    --image-size 704 \
    --cfg-scale 7.5 \
    --infer-steps 50 \
    --use-deepcache 1 \
    --flow-shift-eval-video 5.0 \
    --save-path ${OUTPUT_BASEPATH} \
    --use-fp8 \
    --infer-min

Run with very low VRAM

cd HunyuanVideo-Avatar

JOBS_DIR=$(dirname $(dirname "$0"))
export PYTHONPATH=./

export MODEL_BASE=./weights
OUTPUT_BASEPATH=./results-poor

checkpoint_path=${MODEL_BASE}/ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states_fp8.pt

export CPU_OFFLOAD=1
CUDA_VISIBLE_DEVICES=0 python3 hymm_sp/sample_gpu_poor.py \
    --input 'assets/test.csv' \
    --ckpt ${checkpoint_path} \
    --sample-n-frames 129 \
    --seed 128 \
    --image-size 704 \
    --cfg-scale 7.5 \
    --infer-steps 50 \
    --use-deepcache 1 \
    --flow-shift-eval-video 5.0 \
    --save-path ${OUTPUT_BASEPATH} \
    --use-fp8 \
    --cpu-offload \
    --infer-min

Run with 10GB VRAM GPU (TeaCache supported)

Thanks to Wan2GP, HunyuanVideo-Avatar now supports single GPU mode with even lower VRAM (10GB) without quality degradation. Check out this great repo.

Run a Gradio Server

cd HunyuanVideo-Avatar

bash ./scripts/run_gradio.sh

🔗 BibTeX

If you find HunyuanVideo-Avatar useful for your research and applications, please cite using this BibTeX:

@misc{hu2025HunyuanVideo-Avatar,
      title={HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters}, 
      author={Yi Chen and Sen Liang and Zixiang Zhou and Ziyao Huang and Yifeng Ma and Junshu Tang and Qin Lin and Yuan Zhou and Qinglin Lu},
      year={2025},
      eprint={2505.20156},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/pdf/2505.20156}, 
}

Acknowledgements

We would like to thank the contributors to the HunyuanVideo, SD3, FLUX, Llama, LLaVA, Xtuner, diffusers and HuggingFace repositories, for their open research and exploration.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
hymm_gradio		hymm_gradio
hymm_sp		hymm_sp
scripts		scripts
weights		weights
.gitignore		.gitignore
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HunyuanVideo-Avatar 🌅

🔥🔥🔥 News!!

📑 Open-source Plan

Contents

Abstract

HunyuanVideo-Avatar Overall Architecture

🎉 HunyuanVideo-Avatar Key Features

High-Dynamic and Emotion-Controllable Video Generation

Various Applications

📜 Requirements

🛠️ Dependencies and Installation

Installation Guide for Linux

🧱 Download Pretrained Models

🚀 Parallel Inference on Multiple GPUs

🔑 Single-gpu Inference

Run with very low VRAM

Run with 10GB VRAM GPU (TeaCache supported)

Run a Gradio Server

🔗 BibTeX

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

Tencent-Hunyuan/HunyuanVideo-Avatar

Folders and files

Latest commit

History

Repository files navigation

HunyuanVideo-Avatar 🌅

🔥🔥🔥 News!!

📑 Open-source Plan

Contents

Abstract

HunyuanVideo-Avatar Overall Architecture

🎉 HunyuanVideo-Avatar Key Features

High-Dynamic and Emotion-Controllable Video Generation

Various Applications

📜 Requirements

🛠️ Dependencies and Installation

Installation Guide for Linux

🧱 Download Pretrained Models

🚀 Parallel Inference on Multiple GPUs

🔑 Single-gpu Inference

Run with very low VRAM

Run with 10GB VRAM GPU (TeaCache supported)

Run a Gradio Server

🔗 BibTeX

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages