csfufu
/

Revisual-R1-Coldstart

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions Community

🌟 ReVisual-R1 (7B) — Open-Source Multimodal Reasoner

One cold-start, two RL stages, endless reasoning power.

🔑 Highlights

SOTA on 9 tough benchmarks covering visual–math + text reasoning.
Three-Stage SRO Training
1. Text Cold-Start — seed deep reflection
2. Multimodal RL — align vision & logic
3. Text RL — polish fluency & brevity
PAD (Prioritized Advantage Distillation) keeps gradients alive.
Efficient-Length Reward = concise, self-reflective CoT.

📚 Resources

Paper
Code

📌 Citation

@article{chen2025advancing,
  title={Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning},
  author={Chen, Shuang and Guo, Yue and Su, Zhaochen and Li, Yafu and Wu, Yulun and Chen, Jiacheng and Chen, Jiayu and Wang, Weijie and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2506.04207},
  year={2025}
}

Take ReVisual-R1 for a spin and let us know what you build! 🎯

Downloads last month: 1,016

Safetensors

Model size

8.29B params

Tensor type

BF16

·

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for csfufu/Revisual-R1-Coldstart

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(383)

this model

Collection including csfufu/Revisual-R1-Coldstart

Revisual-R1

🚀ReVisual-R1 is a 7B open-source multimodal language model that follows a three-stage curriculum—cold-start pre-training, multimodal reinforcement le • 4 items • Updated about 8 hours ago • 3