Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie¹ Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

This model is part of the Show-o2 family of improved native unified multimodal models.

Paper: Show-o2: Improved Native Unified Multimodal Models Code: https://github.com/showlab/Show-o/tree/main/show-o2

Abstract

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL .

What is the new about Show-o2?

We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation.

Pre-trained Model Weigths

The Show-o2 checkpoints can be found on Hugging Face:

Getting Started

First, set up the environment:

bash build_env.sh

wandb login <your wandb keys>

Download Wan2.1 3D causal VAE model weight here and put it on the current directory.

Demo for Multimodal Understanding and you can find the results on wandb.

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-jane-pham-727419-1571673.jpg question='Describe the image in detail.'

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-fotios-photos-2923436.jpg question='请告诉我图片中写着什么？'

python3 inference_mmu.py config=configs/showo2_7b_demo_432x432.yaml \
                         mmu_image_path=./docs/mmu/pexels-taryn-elliott-4144459.jpg question='How many avocados (including the halved) are in this image? Tell me how to make an avocado milkshake in detail.'

Demo for Text-to-Image Generation and you can find the results on wandb.

python3 inference_t2i.py config=configs/showo2_1.5b_demo_1024x1024.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;
         
python3 inference_t2i.py config=configs/showo2_1.5b_demo_512x512.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;
                                      
python3 inference_t2i.py config=configs/showo2_1.5b_demo_432x432.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;

python3 inference_t2i.py config=configs/showo2_7b_demo_432x432.yaml \
                         batch_size=4 guidance_scale=7.5 num_inference_steps=50;

Citation

To cite the paper and model, please use the below:

@article{xie2025showo2,
  title={Show-o2: Improved Native Unified Multimodal Models},
  author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint},
  year={2025}
}

Acknowledgments

This work is heavily based on Show-o.