RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization
Abstract
RLoop, a self-improving framework using iterative policy initialization and Rejection-sampling Fine-Tuning, mitigates overfitting and enhances generalization in Reinforcement Learning for Verifiable Rewards.
While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
Community
Introducing RLoop ๐, our new framework to fix overfitting in Reinforcement Learning!
In RL, models often hit high training rewards but fail to generalize. We found the cause: "catastrophic forgetting" discards diverse, valuable policies learned during training.
RLoop solves this by turning the entire RL training process into a self-improvement loop:
- Explore: Run RL and collect successful solutions from all intermediate checkpoints.
- Exploit & Re-initialize: Use this "expert data" to refine the starting policy for the next RL run.
By iteratively exploring and exploiting, RLoop converts fleeting discoveries into robust, generalizable skills. On math reasoning benchmarks, it delivered a +9% accuracy and +15% pass@32 boost.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient (2025)
- More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration (2025)
- HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness (2025)
- Think Outside the Policy: In-Context Steered Policy Optimization (2025)
- EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget (2025)
- Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners (2025)
- SimKO: Simple Pass@K Policy Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper