arXiv:2511.04285

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Published on Nov 6

· Submitted by

Authors:

Abstract

RLoop, a self-improving framework using iterative policy initialization and Rejection-sampling Fine-Tuning, mitigates overfitting and enhances generalization in Reinforcement Learning for Verifiable Rewards.

AI-generated summary

While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

View arXiv page View PDF Add to collection

Community

zyzeng

Paper submitter 1 day ago

Introducing RLoop 🔄, our new framework to fix overfitting in Reinforcement Learning!

In RL, models often hit high training rewards but fail to generalize. We found the cause: "catastrophic forgetting" discards diverse, valuable policies learned during training.

RLoop solves this by turning the entire RL training process into a self-improvement loop:

Explore: Run RL and collect successful solutions from all intermediate checkpoints.
Exploit & Re-initialize: Use this "expert data" to refine the starting policy for the next RL run.

By iteratively exploring and exploiting, RLoop converts fleeting discoveries into robust, generalizable skills. On math reasoning benchmarks, it delivered a +9% accuracy and +15% pass@32 boost.

librarian-bot

about 19 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.04285 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.04285 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.04285 in a Space README.md to link it from this page.