Emu3.5-Image 是由国内 BAAI 智源研究院开源的最新全模态模型，效果比肩 Google-Banana，以下介绍内容引用自官方模型介绍页。本模型为社区研究学习使用，请遵守官方相关版权协议。

---
license: apache-2.0
language:
- zh
- en
tasks:
- text-to-image-synthesis
base_model:
- BAAI/Emu3.5-Image
frameworks: PyTorch
base_model_relation: quantized
pipeline_tag: any-to-any
---
===================================================================================

本模型为：https://huggingface.co/BAAI/Emu3.5-Image 的 NF4 量化版本，可用官方 inference 代码直接加载，需加装 bitsandbytes 依赖。
模型全部加载到显卡的情况下，需占用 24GB，跑图最大需要 32GB 显存。(根据本人测试情况，安装 flash_attn==2.7.4 预编译 whl 也行)

<img src="./sample.png" alt="Example Generated Image" width="800">

Prompt: "Live shot, close-up, full-body photo, a snow leopard standing on a rock, the body is standing sideways, standing slightly upward on the rock, the tail is slightly cocked, the head is twisted to face the camera, the eyes are looking directly at the camera, the expression is majestic, the background is slightly blurred in the distance, gray rocks and mountains."

<h1>Emu3.5-Image 是由国内 BAAI 智源研究院 开源的最新全模态模型，效果比肩 Google-Banana，以下介绍内容引用自官方模型介绍页。本模型为社区研究学习使用，请遵守官方相关版权协议。</h1>

===================================================================================

<div align='center'>
<h1>Emu3.5: Native Multimodal Models are World Learners</h1>

Emu3.5 Team, BAAI

[Project Page](https://emu.world/) | [🤗HF Models](https://huggingface.co/collections/BAAI/emu35) | [Paper](https://arxiv.org/pdf/2510.26583)
</div>


<div align='center'>
<img src="https://github.com/baaivision/Emu3.5/blob/main/assets/arch.png?raw=True" class="interpolation-image" alt="arch." height="100%" width="100%" />
</div>


<div align='center'>
<img src="https://github.com/baaivision/Emu3.5/blob/main/assets/co.png?raw=True" class="interpolation-image" alt="arch." height="90%" width="90%" />
</div>


|  🔹 | **Core Concept**                         | **Description**                                                                                                                            |
| :-: | :--------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------- |
|  🧠 | **Unified World Modeling**               | Predicts the **next state jointly across vision and language**, enabling coherent **world modeling** and **generation**.              |
|  🧩 | **End-to-End Pretraining**               | Trained with a **unified next-token prediction** objective over **interleaved vision–language sequences**.                                 |
|  📚 | **Over 10T+ Multimodal Tokens**               | Pre-trained on **over 10 trillion interleaved tokens** from **video frames** and **transcripts**, capturing **spatiotemporal structure**.       |
|  🔄 | **Native Multimodal I/O**                | Processes and generates **interleaved visual–text sequences** without **modality adapters** or **task-specific heads**.                    |
|  🎯 | **RL Post-Training**                     | Large-scale **reinforcement learning** enhances **reasoning**, **compositionality**, and **generation quality**.                           |
|  ⚡  | **Discrete Diffusion Adaptation (DiDA)** | Converts **sequential decoding → bidirectional parallel prediction**, achieving **≈20× faster inference without performance loss**.      |
| 🖼️ | **Versatile Generation**                 | Excels in **long-horizon vision–language generation**, **any-to-image (X2I)** synthesis, and **text-rich image creation**.                 |
|  🌐 | **Generalizable World Modeling**         | Enables **spatiotemporally consistent world exploration**, and **open-world embodied manipulation** across diverse scenarios.          |
|  🏆 | **Performance Benchmark**                | Matches **Gemini 2.5 Flash Image (Nano Banana)** on **image generation/editing**, and **outperforms** on **interleaved generation tasks**. |


## Table of Contents

1. [Model & Weights](#1-model--weights)
2. [Quick Start](#2-quick-start)
3. [Schedule](#3-schedule)
4. [Citation](#4-citation)

## 1. Model & Weights

| Model name               | HF Weight |
| ------------------------ | --------- |
| Emu3.5               | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5/tree/main) |
| Emu3.5-Image                | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5-Image/tree/main) |
| Emu3.5-VisionTokenizer     | [🤗 HF link](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer/tree/main) |

## 2. Quick Start

### Environment Setup

```bash
git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
```
### Configuration

Edit `configs/config.py` to set:

- Paths: `model_path`, `vq_path`
- Task template: `task_type in {t2i, x2i, howto, story, explore, vla}`, `use_image` controls `<|IMAGE|>` usage (set to true when reference images are provided)
- Sampling: `sampling_params` (classifier_free_guidance, temperature, top_k/top_p, etc.)

### Run Inference

```bash
python inference.py --cfg configs/config.py
```

Protobuf outputs are written to `outputs/<exp_name>/proto/`. For better throughput, we recommend ≥2 GPUs.

### Visualize Protobuf Outputs

To visualize generated protobuf files:

```bash
python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir>
```

## 3. Schedule

- [x] Inference Code
- [ ] Advanced Image Decoder
- [ ] Discrete Diffusion Adaptation(DiDA)


## 4. Citation

```bibtex
@misc{cui2025emu35nativemultimodalmodels,
      title={Emu3.5: Native Multimodal Models are World Learners}, 
      author={Yufeng Cui and Honghao Chen and Haoge Deng and Xu Huang and Xinghang Li and Jirong Liu and Yang Liu and Zhuoyan Luo and Jinsheng Wang and Wenxuan Wang and Yueze Wang and Chengyuan Wang and Fan Zhang and Yingli Zhao and Ting Pan and Xianduo Li and Zecheng Hao and Wenxuan Ma and Zhuo Chen and Yulong Ao and Tiejun Huang and Zhongyuan Wang and Xinlong Wang},
      year={2025},
      eprint={2510.26583},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.26583}, 
}
```