GitHub - VIPL-GENUN/Jodi: Codebase for "Jodi: Unification of Visual Generation and Understanding via Joint Modeling"

Jodi: Unification of Visual Generation and Understanding via Joint Modeling
Yifeng Xu^1,2, Zhenliang He¹, Meina Kan^1,2, Shiguang Shan^1,2, Xilin Chen^1,2
¹State Key Lab of AI Safety, Institute of Computing Technology, CAS, China
²University of Chinese Academy of Sciences, China

We introduce Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Jodi is built upon a linear diffusion transformer with a role switch mechanism, enabling joint generation, controllable generation, and image perception in a unified diffusion model.

💥 News

[2025-06-17]: The training code and Joint-1.6M dataset are released.
[2025-05-27]: The arXiv paper, model weights, and inference code are released.

🛠️ Installation

The code is tested with python 3.10.0, torch 2.4.0, and cuda 12.1.

Clone this repo:

git clone https://github.com/VIPL-GENUN/Jodi.git
cd Jodi

Create and activate a new conda environment:

conda create -n jodi python=3.10.0 -y
conda activate jodi

Install dependencies:

pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

🤖️ Download Models

We provide our model on HuggingFace. The model will be automatically downloaded when you launch the Gradio demo, or you can download it manually using the following command:

huggingface-cli download VIPL-GENUN/Jodi

🚀 Gradio Demo

python app/jodi_gradio.py --model_path hf://VIPL-GENUN/Jodi/Jodi.pth

🔥 Training

Step 1: Data Preparation

We provide the Joint-1.6M dataset on HuggingFace. To help you get started quickly, we also provide a small example dataset in assets/example_data with the same file structure as the Joint-1.6M dataset.

assets/example_data
├── metadata.jsonl
├── image
│   ├── 0adbfa3cab59b674b83f24a7964ae23f.jpg
│   ├── 0aded2a84831be7b912ef85f6c1eb6e2.jpg
│   └── 0adf204564879c270bafba334ca99e3c.jpg
├── annotation_edge
│   └── (same as images)
├── annotation_depth
│   └── (same as images)
└── ...

The code will load the data based on metadata.jsonl. Each line in metadata.jsonl is a dictionary containing paths to an image and its annotations (labels), height and width, and captions from different models. For example:

{
  "image": "image/0adbfa3cab59b674b83f24a7964ae23f.jpg",
  "info": {"height": 1280, "width": 1024},
  "caption": {"Qwen2-VL-7b-Instruct": "xxxxxxxx", "BLIP2-OPT-2.7b": "yyy"},
  "annotation_edge": "annotation_edge/0adbfa3cab59b674b83f24a7964ae23f.jpg",
  "annotation_depth": "annotation_depth/0adbfa3cab59b674b83f24a7964ae23f.jpg",
  # ...
}

Step 2: Download Models

Jodi is built on top of Sana. You can either finetune Jodi or directly train the model from Sana.

# download Jodi
huggingface-cli download VIPL-GENUN/Jodi
# download Sana
huggingface-cli download Efficient-Large-Model/Sana_1600M_1024px_BF16

Step 3: Start Training

Finetune Jodi on example dataset:

bash scripts/train_from_jodi.sh ./configs/train_example_data.yaml

Train from Sana on Joint-1.6M dataset:

bash scripts/train_from_sana.sh ./configs/train_joint1.6m.yaml

🪧 Acknowledgement

This project is built upon Sana. Thanks for their great work!

✏️ Citation

If you find this project helpful, please consider citing:

@article{xu2025jodi,
  title={Jodi: Unification of Visual Generation and Understanding via Joint Modeling},
  author={Xu, Yifeng and He, Zhenliang and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  journal={arXiv preprint arXiv:2505.19084},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

💥 News

🛠️ Installation

🤖️ Download Models

🚀 Gradio Demo

🔥 Training

Step 1: Data Preparation

Step 2: Download Models

Step 3: Start Training

🪧 Acknowledgement

✏️ Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
app		app
assets		assets
configs		configs
data		data
diffusion		diffusion
model		model
scripts		scripts
tools		tools
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

VIPL-GENUN/Jodi

Folders and files

Latest commit

History

Repository files navigation

💥 News

🛠️ Installation

🤖️ Download Models

🚀 Gradio Demo

🔥 Training

Step 1: Data Preparation

Step 2: Download Models

Step 3: Start Training

🪧 Acknowledgement

✏️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages