arxiv:2510.27606

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Published on Oct 31

· Submitted by

Yuhang Zang on Nov 3

Intern Large Models

Upvote

Authors:

Yuhang Zang ,

Abstract

Spatial-SSRL, a self-supervised reinforcement learning paradigm, enhances spatial understanding in Large Vision-Language Models using verifiable signals from RGB or RGB-D images without human annotation.

AI-generated summary

Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

View arXiv page View PDF GitHub 46 Add to collection

Community

yuhangzang

Paper author Paper submitter 1 day ago

•

edited 1 day ago

🛰️ Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

📘 Paper: arXiv:2510.27606
💻 Code: github.com/InternLM/Spatial-SSRL

🧩 Abstract

Spatial understanding remains a key weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) pipelines rely on costly human supervision, specialized tools, or closed environments that hinder scalability.

We propose Spatial-SSRL, a self-supervised reinforcement learning framework that derives verifiable spatial signals directly from ordinary RGB/RGB-D images—no annotations required.
Spatial-SSRL formulates five intrinsic pretext tasks capturing 2D and 3D spatial structures:

🧩 Shuffled Patch Reordering
🔄 Flipped Patch Recognition
🖼️ Cropped Patch Inpainting
🌗 Regional Depth Ordering
📐 Relative 3D Position Prediction

Each task produces verifiable ground-truth feedback, enabling RLVR training without human or LVLM labels.

🚀 Key Results

Improves spatial reasoning while preserving general visual capabilities
Evaluated on 7 spatial understanding benchmarks (image + video)
Achieves +4.63% (3B) and +3.89% (7B) accuracy gains over Qwen2.5-VL baselines

🧠 Takeaway

Spatial-SSRL shows that simple, intrinsic supervision can scale RLVR efficiently — paving the way toward stronger spatial intelligence in LVLMs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Abstract

Community

🛰️ Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

📘 Paper: arXiv:2510.27606
💻 Code: github.com/InternLM/Spatial-SSRL

🧩 Abstract

Each task produces verifiable ground-truth feedback, enabling RLVR training without human or LVLM labels.

🚀 Key Results

🧠 Takeaway

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 3

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Abstract

Community

🛰️ Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

📘 Paper: arXiv:2510.27606💻 Code: github.com/InternLM/Spatial-SSRL

🧩 Abstract

Each task produces verifiable ground-truth feedback, enabling RLVR training without human or LVLM labels.

🚀 Key Results

🧠 Takeaway

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 3

📘 Paper: arXiv:2510.27606
💻 Code: github.com/InternLM/Spatial-SSRL