Jodi: Unification of Visual Generation and Understanding via Joint Modeling
Yifeng Xu1,2, Zhenliang He1, Meina Kan1,2, Shiguang Shan1,2, Xilin Chen1,2
1State Key Lab of AI Safety, Institute of Computing Technology, CAS, China
2University of Chinese Academy of Sciences, China
We introduce Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Jodi is built upon a linear diffusion transformer with a role switch mechanism, enabling joint generation, controllable generation, and image perception in a unified diffusion model.
- [2025-06-17]: The training code and Joint-1.6M dataset are released.
- [2025-05-27]: The arXiv paper, model weights, and inference code are released.
The code is tested with python 3.10.0, torch 2.4.0, and cuda 12.1.
Clone this repo:
git clone https://github.com/VIPL-GENUN/Jodi.git
cd Jodi
Create and activate a new conda environment:
conda create -n jodi python=3.10.0 -y
conda activate jodi
Install dependencies:
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
We provide our model on HuggingFace. The model will be automatically downloaded when you launch the Gradio demo, or you can download it manually using the following command:
huggingface-cli download VIPL-GENUN/Jodi
python app/jodi_gradio.py --model_path hf://VIPL-GENUN/Jodi/Jodi.pth
We provide the Joint-1.6M dataset on HuggingFace. To help you get started quickly, we also provide a small example dataset in assets/example_data with the same file structure as the Joint-1.6M dataset.
assets/example_data
├── metadata.jsonl
├── image
│ ├── 0adbfa3cab59b674b83f24a7964ae23f.jpg
│ ├── 0aded2a84831be7b912ef85f6c1eb6e2.jpg
│ └── 0adf204564879c270bafba334ca99e3c.jpg
├── annotation_edge
│ └── (same as images)
├── annotation_depth
│ └── (same as images)
└── ...
The code will load the data based on metadata.jsonl
.
Each line in metadata.jsonl
is a dictionary containing paths to an image and its annotations (labels), height and width, and captions from different models. For example:
{
"image": "image/0adbfa3cab59b674b83f24a7964ae23f.jpg",
"info": {"height": 1280, "width": 1024},
"caption": {"Qwen2-VL-7b-Instruct": "xxxxxxxx", "BLIP2-OPT-2.7b": "yyy"},
"annotation_edge": "annotation_edge/0adbfa3cab59b674b83f24a7964ae23f.jpg",
"annotation_depth": "annotation_depth/0adbfa3cab59b674b83f24a7964ae23f.jpg",
# ...
}
Jodi is built on top of Sana. You can either finetune Jodi or directly train the model from Sana.
# download Jodi
huggingface-cli download VIPL-GENUN/Jodi
# download Sana
huggingface-cli download Efficient-Large-Model/Sana_1600M_1024px_BF16
Finetune Jodi on example dataset:
bash scripts/train_from_jodi.sh ./configs/train_example_data.yaml
Train from Sana on Joint-1.6M dataset:
bash scripts/train_from_sana.sh ./configs/train_joint1.6m.yaml
This project is built upon Sana. Thanks for their great work!
If you find this project helpful, please consider citing:
@article{xu2025jodi,
title={Jodi: Unification of Visual Generation and Understanding via Joint Modeling},
author={Xu, Yifeng and He, Zhenliang and Kan, Meina and Shan, Shiguang and Chen, Xilin},
journal={arXiv preprint arXiv:2505.19084},
year={2025}
}