Papers
arxiv:2510.27606

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Published on Oct 31
Β· Submitted by Yuhang Zang on Nov 3
Authors:
,
,
,
,
,
,
,

Abstract

Spatial-SSRL, a self-supervised reinforcement learning paradigm, enhances spatial understanding in Large Vision-Language Models using verifiable signals from RGB or RGB-D images without human annotation.

AI-generated summary

Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.

Community

Paper author Paper submitter
β€’
edited 1 day ago

πŸ›°οΈ Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

πŸ“˜ Paper: arXiv:2510.27606
πŸ’» Code: github.com/InternLM/Spatial-SSRL

🧩 Abstract

Spatial understanding remains a key weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) pipelines rely on costly human supervision, specialized tools, or closed environments that hinder scalability.

We propose Spatial-SSRL, a self-supervised reinforcement learning framework that derives verifiable spatial signals directly from ordinary RGB/RGB-D imagesβ€”no annotations required.
Spatial-SSRL formulates five intrinsic pretext tasks capturing 2D and 3D spatial structures:

  1. 🧩 Shuffled Patch Reordering
  2. πŸ”„ Flipped Patch Recognition
  3. πŸ–ΌοΈ Cropped Patch Inpainting
  4. πŸŒ— Regional Depth Ordering
  5. πŸ“ Relative 3D Position Prediction

Each task produces verifiable ground-truth feedback, enabling RLVR training without human or LVLM labels.

πŸš€ Key Results

  • Improves spatial reasoning while preserving general visual capabilities
  • Evaluated on 7 spatial understanding benchmarks (image + video)
  • Achieves +4.63% (3B) and +3.89% (7B) accuracy gains over Qwen2.5-VL baselines

🧠 Takeaway

Spatial-SSRL shows that simple, intrinsic supervision can scale RLVR efficiently β€” paving the way toward stronger spatial intelligence in LVLMs.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 3