ROVER-Qwen3-4B: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

This repository hosts the ROVER-Qwen3-4B model, a large language model fine-tuned for math reasoning with verifiable rewards, as presented in the paper Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards.

ROVER (Random Policy Valuation for Diverse Reasoning) proposes a minimalist yet highly effective Reinforcement Learning (RL) method for LLM reasoning, achieving superior optimality and diversity by evaluating uniform-policy Q-values. It bypasses complex policy optimization loops, leading to more stable training and improved diversity in generated reasoning paths.

Abstract: RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce Random Policy Valuation for Diverse Reasoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both quality (+8.2 on pass@1, +16.8 on pass@256) and diversity (+17.6%), despite its radical simplification compared to strong, complicated existing methods.

🏆 Main Results and Features

Figure 1: (a) ROVER achieves superior performances in terms of both pass@1 and pass@256 (trained on Qwen3-8B-Base averaged over AIME24, AIME24 and HMMT25 tasks). (b) Illustrative example demonstrating that ROVER achieves high-quality solutions with a lightweight procedure (see Table below for details) while maintaining diversity. (c) ROVER achieves higher diversity.

ROVER needs minimal GPU memory and computation cost, leaving more space for the KV cache. This allows ROVER to run on smaller memory setups and speeds up training:

Method	Memory Usage of Model Parameters
ROVER (Ours)	Low (actor model ONLY!😊)
GRPO	Medium (actor + reference model)
PPO	High (actor + reference + critic model)

🤗 Models

Models	Tasks
🤗ROVER-Qwen3-4B	Math Reasoning
🤗ROVER-Qwen3-8B	Math Reasoning
🤗ROVER-countdown-3B	Countdown Games

For more details on installation, training, evaluation, and the full codebase, please refer to the official GitHub repository.

📖 Citation

If you find the project useful, please consider citing our paper:

@article{he2025randompolicyvaluation,
      title={Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards}, 
      author={Haoran He and Yuxiao Ye and Qingpeng Cai and Chen Hu and Binxing Jiao and Daxin Jiang and Ling Pan},
      journal={arXiv preprint arXiv:2509.24981},
      year={2025}
}

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for haoranhe/ROVER-Qwen3-4B

Base model

Qwen/Qwen3-4B-Base

Finetuned

(158)

this model

haoranhe
/

ROVER-Qwen3-4B

ROVER-Qwen3-4B: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

🏆 Main Results and Features

🤗 Models

📖 Citation

Model tree for haoranhe/ROVER-Qwen3-4B

Dataset used to train haoranhe/ROVER-Qwen3-4B