File size: 1,350 Bytes
69dd2b3 024a05a 8b0fd47 024a05a 69dd2b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
---
license: cc-by-nc-4.0
pipeline_tag: image-text-to-text
---
# IVT-LR
## Overview
This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).
Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
---
## Usage
This repository provides pretrained models for **Qwen2-VL on M3CoT** and **Chameleon on ScienceQA**.
To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/FYYDCC/IVT-LR).
---
### Download Models
You can download the models directly from Hugging Face using `huggingface_hub`:
```python
from huggingface_hub import hf_hub_download
# Example: download Qwen2-VL model
qwen_model_path = hf_hub_download("FYYDCC/IVTLR", "qwen_vl/model.pth")
# Example: download Chameleon model
chameleon_model_path = hf_hub_download("FYYDCC/IVTLR", "chameleon/model.pth")
``` |