JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers (ICCV 2025)

Overview

The preprint version is available on arXiv.
πŸ“‘ Paper | πŸ€— Project Page | πŸ€— Code

JointDiT is a multimodal diffusion transformer that jointly models RGB and Depth.
It supports the following tasks:

  • Text to joint RGB-Depth generation
  • Depth estimation from RGB
  • Depth-conditioned image generation


How to Use

JointDiT is built on top of black-forest-labs/FLUX.1-dev,
but requires additional modules and a custom pipeline implementation.

πŸ‘‰ Please visit the GitHub repository
for installation, training, and inference instructions.


Citation

If you find this work useful, please cite:

@article{byung2025jointdit,
  title={JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers},
  author={Byung-Ki, Kwon and Dai, Qi and Hyoseok, Lee and Luo, Chong and Oh, Tae-Hyun},
  journal={arXiv preprint arXiv:2505.00482},
  year={2025}
}
Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for byungki-kwon/JointDiT

Finetuned
(524)
this model