IVTLR / README.md
FYYDCC's picture
Add link to paper (#2)
8b0fd47 verified
metadata
license: cc-by-nc-4.0
pipeline_tag: image-text-to-text

IVT-LR

Overview

This model was presented in the paper Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space.

Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text and latent vision. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.


Usage

This repository provides pretrained models for Qwen2-VL on M3CoT and Chameleon on ScienceQA.

To see detailed usage, including inference code and scripts for training, please refer to the GitHub repository.


Download Models

You can download the models directly from Hugging Face using huggingface_hub:

from huggingface_hub import hf_hub_download

# Example: download Qwen2-VL model
qwen_model_path = hf_hub_download("FYYDCC/IVTLR", "qwen_vl/model.pth")

# Example: download Chameleon model
chameleon_model_path = hf_hub_download("FYYDCC/IVTLR", "chameleon/model.pth")