showlab/show-o2-1.5B · Hugging Face

Improved Native Unified Multimodal Models

Jinheng Xie¹ Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

What is the new about Show-o2?

We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation.

Pre-trained Model Weigths

The Show-o2 checkpoints can be found on Hugging Face:

Getting Started

First, set up the environment:

bash build_env.sh

wandb login <your wandb keys>

Download Wan2.1 3D causal VAE model weight here and put it on the current directory.

Demo for Multimodal Understanding and you can find the results on wandb.

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-fotios-photos-2923436.jpg question='请告诉我图片中写着什么？'

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'

Demo for Text-to-Image Generation and you can find the results on wandb.

python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;
         
python3 inference_t2i.py config=configs/showo2_1.5b_demo_512x512.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;
                                      
python3 inference_t2i.py config=configs/showo2_1.5b_demo_432x432.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;

python3 inference_t2i.py config=configs/showo2_7b_demo_432x432.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;

Citation

To cite the paper and model, please use the below:

@article{xie2025showo2,
  title={Show-o2: Improved Native Unified Multimodal Models},
  author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint},
  year={2025}
}

Acknowledgments

This work is heavily based on Show-o.