Skip to content

VIPL-GENUN/Jodi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jodi: Unification of Visual Generation and Understanding via Joint Modeling
Yifeng Xu1,2, Zhenliang He1, Meina Kan1,2, Shiguang Shan1,2, Xilin Chen1,2
1State Key Lab of AI Safety, Institute of Computing Technology, CAS, China
2University of Chinese Academy of Sciences, China

banner

We introduce Jodi, a diffusion framework that unifies visual generation and understanding by jointly modeling the image domain and multiple label domains. Jodi is built upon a linear diffusion transformer with a role switch mechanism, enabling joint generation, controllable generation, and image perception in a unified diffusion model.

💥 News

🛠️ Installation

The code is tested with python 3.10.0, torch 2.4.0, and cuda 12.1.

Clone this repo:

git clone https://github.com/VIPL-GENUN/Jodi.git
cd Jodi

Create and activate a new conda environment:

conda create -n jodi python=3.10.0 -y
conda activate jodi

Install dependencies:

pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121
pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

🤖️ Download Models

We provide our model on HuggingFace. The model will be automatically downloaded when you launch the Gradio demo, or you can download it manually using the following command:

huggingface-cli download VIPL-GENUN/Jodi

🚀 Gradio Demo

python app/jodi_gradio.py --model_path hf://VIPL-GENUN/Jodi/Jodi.pth

🔥 Training

Step 1: Data Preparation

We provide the Joint-1.6M dataset on HuggingFace. To help you get started quickly, we also provide a small example dataset in assets/example_data with the same file structure as the Joint-1.6M dataset.

assets/example_data
├── metadata.jsonl
├── image
│   ├── 0adbfa3cab59b674b83f24a7964ae23f.jpg
│   ├── 0aded2a84831be7b912ef85f6c1eb6e2.jpg
│   └── 0adf204564879c270bafba334ca99e3c.jpg
├── annotation_edge
│   └── (same as images)
├── annotation_depth
│   └── (same as images)
└── ...

The code will load the data based on metadata.jsonl. Each line in metadata.jsonl is a dictionary containing paths to an image and its annotations (labels), height and width, and captions from different models. For example:

{
  "image": "image/0adbfa3cab59b674b83f24a7964ae23f.jpg",
  "info": {"height": 1280, "width": 1024},
  "caption": {"Qwen2-VL-7b-Instruct": "xxxxxxxx", "BLIP2-OPT-2.7b": "yyy"},
  "annotation_edge": "annotation_edge/0adbfa3cab59b674b83f24a7964ae23f.jpg",
  "annotation_depth": "annotation_depth/0adbfa3cab59b674b83f24a7964ae23f.jpg",
  # ...
}

Step 2: Download Models

Jodi is built on top of Sana. You can either finetune Jodi or directly train the model from Sana.

# download Jodi
huggingface-cli download VIPL-GENUN/Jodi
# download Sana
huggingface-cli download Efficient-Large-Model/Sana_1600M_1024px_BF16

Step 3: Start Training

Finetune Jodi on example dataset:

bash scripts/train_from_jodi.sh ./configs/train_example_data.yaml

Train from Sana on Joint-1.6M dataset:

bash scripts/train_from_sana.sh ./configs/train_joint1.6m.yaml

🪧 Acknowledgement

This project is built upon Sana. Thanks for their great work!

✏️ Citation

If you find this project helpful, please consider citing:

@article{xu2025jodi,
  title={Jodi: Unification of Visual Generation and Understanding via Joint Modeling},
  author={Xu, Yifeng and He, Zhenliang and Kan, Meina and Shan, Shiguang and Chen, Xilin},
  journal={arXiv preprint arXiv:2505.19084},
  year={2025}
}

About

Codebase for "Jodi: Unification of Visual Generation and Understanding via Joint Modeling"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •