Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Installation

To run the code, you should install the requiresments. The code is run on python3.9 and pytorch 2, tensorflow==2.15.0, CUDA 12.1.

clone the code as follow,

git clone https://github.com/RoboDita/Dita

then, please consider install the base environment,

 pip install -r requirements_base.txt

If you only evaluate it on calvin, you might use requirements_calvin.txt. To avoid the conflicts, we suggest you install tensorflow-probability==0.22.0 independently.

Meanwhile, you might need to install pytorch3d (This is not necessary for pretraining). We build pytorch3d from this git+https://github.com/facebookresearch/pytorch3d.git@89653419d0973396f3eff1a381ba09a07fffc2ed#egg=pytorch3d.

Model Checkpoints

We provide the corresponding models, that can be utilized for finetuing.

Model	Description	Checkpoint Path
Dita	Diffusion Transformer Policy	Google Drive
Dita	Diffusion Transformer Policy(Pretrained on Droid)	Google Drive
Dita	Diffusion Transformer Policy (w/o image augmentation)	Google Drive
Diffusion MLP Head	Transformer with Diffusion Head Policy (w/o image augmentation)	Google Drive

Training & Finetuning

PRETRAINING on OXE dataset

Before you run the code, you should update the s3 key "AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "S3_ENDPOINT". We train the network with 32 GPUs. Meanwhile, please update 'data_path = "s3://openx"' in scripts/train_diffusion_oxe.py'. In our experiments, we do not find a significant difference between dataset.traj_length=32 and dataset.traj_length=16 on OXE pretraining. Therefore, we suggest dataset.traj_length=16 num_pred_action=15

torchrun --nproc_per_node=8 --nnodes=4 --node_rank=0 --master_port=$MASTER_PORT --master_addr=$MASTER_ADDR scripts/train_diffusion_oxe.py task_name=openx_full_train_o2_p32 dataset.traj_length=16 num_pred_action=15 scheduler_type=1 shuffle_buffer_size=256000 dataname=oxe_magic_soup_plus task_name=oxe_full_train_o2_p15_wotimestep_oxe_noclamp_filter batch_size=256

torchrun --nproc_per_node=8 --nnodes=4 --node_rank=1 --master_port=$MASTER_PORT --master_addr=$MASTER_ADDR scripts/train_diffusion_oxe.py task_name=openx_full_train_o2_p32 dataset.traj_length=16 num_pred_action=15 scheduler_type=1 shuffle_buffer_size=256000 dataname=oxe_magic_soup_plus task_name=oxe_full_train_o2_p15_wotimestep_oxe_noclamp_filter batch_size=256

torchrun --nproc_per_node=8 --nnodes=4 --node_rank=2 --master_port=$MASTER_PORT --master_addr=$MASTER_ADDR scripts/train_diffusion_oxe.py task_name=openx_full_train_o2_p32 dataset.traj_length=16 num_pred_action=15 scheduler_type=1 shuffle_buffer_size=256000 dataname=oxe_magic_soup_plus task_name=oxe_full_train_o2_p15_wotimestep_oxe_noclamp_filter batch_size=256

torchrun --nproc_per_node=8 --nnodes=4 --node_rank=3 --master_port=$MASTER_PORT --master_addr=$MASTER_ADDR scripts/train_diffusion_oxe.py task_name=openx_full_train_o2_p32 dataset.traj_length=16 num_pred_action=15 scheduler_type=1 shuffle_buffer_size=256000 dataname=oxe_magic_soup_plus task_name=oxe_full_train_o2_p15_wotimestep_oxe_noclamp_filter batch_size=256

We observe that image augmentation is beneficial for SimplerEnv in our experiments. If you want to use image augmentation, please add ``+image_aug=1''

Finetuning with Lora

Here, we provide an example for finetuning with lora, i.e., the 10-shot finetuning code on Real-Franka Arm.


torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 scripts/finetune_realdata.py +pretrained_path=dit_policy_checkpoint.pth dataset.traj_per_episode=16 dataset.traj_length=1 task_name=new_test_nodiffhead_few10_250124 num_pred_action=1 dataname=lab_907_1 batch_size=32 dataset.train_data_list=you pkl dataname file to include the collected pkl files name use_lora=True scheduler_type=0 dataset.num_given_observation=1  max_iters=10000

scheduler_type=0 indicates we use 100 DDPM training steps.

We would like to highlight that the finetuning without lora can achieve cleanly better performance (robustness to variances, more horizon tasks).

torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 scripts/finetune_realdata.py +pretrained_path=dit_policy_checkpoint.pth dataset.traj_per_episode=16 dataset.traj_length=1 task_name=new_test_nodiffhead_few10_250124 num_pred_action=1 dataname=lab_907_1 batch_size=32 dataset.train_data_list=you pkl dataname file to include the collected pkl files name use_lora=False scheduler_type=0 dataset.num_given_observation=1  max_iters=10000

Fully Finetuning on CALVIN

At first, you should follow the instruction-calvin to install CALVIN environment.

we train the network with 4GPUs.

torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 scripts/train_diffusion_sim.py --config-name config_diffusion_calvin batch_size=32 dataset.traj_length=11 num_pred_action=10 task_name=calvin_exp dataset.num_given_observation=2 use_close_loop_eval=True close_loop_eval.test_episodes_num=32 taskname=task_ABC_D dataname=calvin_mc close_loop_eval.eval_iters=10000 close_loop_eval.test_episodes_num=250 scheduler_type=0 wrap_grmg_data=2 +pretrained_path=dit_policy_checkpoint.pth +use_adjust_scheduler=true lr=0.0001 epoch=15 +min_lr_scale=0.01 scheduler.warmup_epochs=1 num_inference_steps=10

Finetuning on LIBERO

Firstly, please follow OpenVLA to set up the LIBERO benchmark and get the modified version of the dataset.

We train and evaluate the model with 8 NVIDIA GPUs.

Here is an example of the training script.

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0  scripts/train_diffusion_oxe.py task_name=finetuning_LIBERO dataname=libero_spatial_no_noops dataset.traj_length=11 num_pred_action=10 scheduler_type=0 shuffle_buffer_size=128000 batch_size=64 use_close_loop_eval=True +trajectory_dim=7 +pretrained_path=dit_policy_checkpoint.pth +use_adjust_scheduler=true lr=0.0001 +min_lr_scale=0.01 +image_aug=true num_inference_steps=10

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0  scripts/train_diffusion_oxe.py task_name=finetuning_LIBERO dataname=libero_10_no_noops dataset.traj_length=11 num_pred_action=10 scheduler_type=0 shuffle_buffer_size=128000 batch_size=64 use_close_loop_eval=True +trajectory_dim=7 +pretrained_path=dit_policy_checkpoint.pth +use_adjust_scheduler=true lr=0.0005 +min_lr_scale=0.01 +image_aug=true num_inference_steps=10

Simulation Benchmark Evaluations

LIBERO Simulation Benchmark Evaluations

Method	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	Average
Diffusion Policy from scratch	78.3	92.5%	68.3 %	50.5 %	72.4 %
Octo fine-tuned	78.9 %	85.7 %	84.6%	51.1 %	75.1 %
OpenVLA fine-tuned	84.7 %	88.4 %	79.2 %	53.7 %	76.5 %
ours fine-tuned	84.2%	96.3%	85.4%	63.8%	82.4%

Calvin (ABC->D)

Method	Input	1	2	3	4	5	Avg.Len.
RoboFlamingo	S-RGB, G-RGB	82.4%	61.9%	46.6%	33.1%	23.5%	2.47
SuSIE	S-RGB	87.0%	69.0%	49.0%	38.0%	26.0%	2.69
GR-1	S-RGB, G-RGB, P	85.4%	71.2%	59.6%	49.7%	40.1%	3.06
3D Diffuser	S-RGBD, G-RGBD, Proprio, Cam	92.2%	78.7%	63.9%	51.2%	41.2%	3.27
ours w/o pretraining	Static-RGB	89.5%	63.3%	39.8%	27.3%	18.5%	2.38
ours	Static-RGB	94.5%	82.5%	72.8%	61.3%	50.0%	3.61

Here is the checkpoint for calvin (ABC->D)calvin_abc-d.pth

You can eval the model as follows,

PYTHONPATH=. python3 scripts/train_diffusion_sim.py --config-name config_diffusion_calvin batch_size=32 dataset.traj_length=11 num_pred_action=10 task_name=calvin_exp dataset.num_given_observation=2 use_close_loop_eval=True close_loop_eval.test_episodes_num=32 taskname=task_ABC_D dataname=calvin_mc close_loop_eval.eval_iters=10000 close_loop_eval.test_episodes_num=250 scheduler_type=0 +wrap_grmg_data=2 +pretrained_path=calvin_abc-d.pth +use_adjust_scheduler=true lr=0.0001 epoch=15 +min_lr_scale=0.01 scheduler.warmup_epochs=1 num_inference_steps=10 +eval_only=1

The evaluation environment requirements are provided in calvin_python3.10_requirements.txt. You can refer this requirements to config your environment for calvin.

Simulation Benchmark Evaluations

SimplerEnv

This evaluation is based on SimplerEnv

models	Dita(ours)	RT-1-X	Octo-Base	OpenVLA
coke_can/matching	0.837	0.567	0.17	0.163
coke_can/variant	0.855	0.490	0.006	0.545
move_near/matching	0.760	0.317	0.042	0.462
move_near/variant	0.730	0.323	0.031	0.477
drawer/matching	0.463	0.597	0.227	0.356
drawer/variant	0.375	0.294	0.011	0.177

Real Franka Demonstration

Please refer to the project page.

Acknowledgement

The dataloader code of OXE and part of the code of libero setup are based on OpenVLA, The dataloader code of CALVIN is based on GR-MG, The architecture is based on transformers. If you have any questions, feel free to contact Zhi Hou (zhou9878 at uni dot sydney dot edu dot au) or Tianyi Zhang (tianyizhang0213 at zju dot edu dot cn)

Citation

If you find our code or models useful in your work, please consider to cite our paper:

@article{hou2025dita,
 title={Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy},
 author={Hou, Zhi and Zhang, Tianyi and Xiong, Yuwen and Duan, Haonan and Pu, Hengjun and Tong, Ronglei and Zhao, Chengyang and Zhu, Xizhou and Qiao, Yu and Dai, Jifeng and Chen, Yuntao},
 journal={arXiv preprint arXiv:2503.19757},
 year={2025}
}

An earlier preprint version is on Arxiv.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Dataset_Lab		Dataset_Lab
Dataset_Sim		Dataset_Sim
Dataset_VLA		Dataset_VLA
SimplerEnv		SimplerEnv
calvin_task/validation/.hydra		calvin_task/validation/.hydra
conf		conf
config		config
openvla		openvla
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calvin_python3.10_requirements.txt		calvin_python3.10_requirements.txt
config.yaml		config.yaml
environment.yaml		environment.yaml
requirements.txt		requirements.txt
requirements_base.txt		requirements_base.txt
requirements_calvin.txt		requirements_calvin.txt
requirements_libero.txt		requirements_libero.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Installation

Model Checkpoints

Training & Finetuning

PRETRAINING on OXE dataset

Finetuning with Lora

Fully Finetuning on CALVIN

Finetuning on LIBERO

Simulation Benchmark Evaluations

LIBERO Simulation Benchmark Evaluations

Calvin (ABC->D)

SimplerEnv

Real Franka Demonstration

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

RoboDita/Dita

Folders and files

Latest commit

History

Repository files navigation

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Installation

Model Checkpoints

Training & Finetuning

PRETRAINING on OXE dataset

Finetuning with Lora

Fully Finetuning on CALVIN

Finetuning on LIBERO

Simulation Benchmark Evaluations

LIBERO Simulation Benchmark Evaluations

Calvin (ABC->D)

SimplerEnv

Real Franka Demonstration

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages