--- library_name: transformers license: mit pipeline_tag: text-generation --- # HAPO: From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature This repository hosts a modified Qwen2.5-Math-1.5B model, serving as the base for the Heterogeneous Adaptive Policy Optimization (HAPO) framework. This model was presented in the paper [From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature](https://huggingface.co/papers/2509.16591). We change the `rope_theta` from 10000 to 40000 and extend the context window to 16k. Also, we modify the `chat_template` for the system prompt and add ``. For the associated code and more details, please visit the [official GitHub repository](https://github.com/starriver030515/HAPO). ## About HAPO Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in the reasoning process. To address this limitation, we introduce **H**eterogeneous **A**daptive **P**olicy **O**ptimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose **Adaptive Temperature Sampling**, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce **Token Level Group Average** that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop **Differential Advantage Redistribution** that leverages entropy and importance ratios to modulate rewards—adjusting updates for tokens with clear signals. For clipping loss, we design **Asymmetric Adaptive Clipping**, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales.

## Installation 1. Clone this repository and navigate to the folder ```bash git clone https://github.com/starriver030515/HAPO cd HAPO ``` 2. Create a conda environment, activate it and install Packages ```Shell conda create -n hapo python=3.10 -y conda activate hapo ``` 3. Execute verl installation script to install dependencies ```bash bash scripts/install_vllm_sglang_mcore.sh pip install -e . ``` ## Usage ### Preparation First download training and evaluation parquet from [hapo_data](https://huggingface.co/datasets/starriver030515/hapo_data). If you use Qwen2.5 Math for training, please download [Qwen2.5-Math-1.5B-16k](https://huggingface.co/starriver030515/Qwen2.5-Math-1.5B-16k) and [Qwen2.5-Math-7B-32k](https://huggingface.co/starriver030515/Qwen2.5-Math-7B-32k), which we modified the max position length to support longer context training. For other models, you can download them from their official repository. To support Adaptive Temperature Sampling, you need to replace the vllm-related files in your corresponding environment with those from HAPO/vllm. ### Train Our training scripts are located in the [recipe](https://github.com/starriver030515/HAPO/tree/main/recipe) folder. You only need to replace `MODEL_PATH`, `TRAIN_FILE` and `TEST_FILE`. You can see detailed parameter explanations in [train.md](https://github.com/starriver030515/HAPO/blob/main/recipe/train.md). ```bash cd recipe bash qwen2.5_math_7b.sh ``` ### Evaluation ```bash cd scripts bash eval_model.sh ``` ## Results Comparison between vanilla DAPO using all tokens, DAPO with forking tokens), Archer, EDGE-GRPO, and HAPO, evaluated on the Qwen-Math-1.5B Base, Qwen-Math-7B Base, and Qwen3-8B Base models. For each question, we generate 8 independent responses under a decoding temperature $T=0.5$, and report the average accuracy.

## Training Dynamics This figure compares the training dynamics of DAPO and HAPO —with respect to four key metrics: - **AIME24 and AIME25 Results**: HAPO consistently achieves higher accuracy across all model sizes (Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Qwen3-8B), demonstrating superior learning efficiency and performance throughout the training process. - **Response Length**: HAPO maintains longer response lengths during training compared to DAPO, indicating more comprehensive and detailed solution generation without compromising quality. - **Mean Entropy**: HAPO preserves significantly higher entropy throughout training across all model configurations, demonstrating better exploration capabilities and maintaining response diversity, which prevents premature convergence to suboptimal solutions.

AIME24 Results — *Figure 1: AIME24 accuracy comparison - HAPO consistently achieves higher accuracy across all model sizes*

Response Length — *Figure 3: Response length over training steps - HAPO maintains longer, more comprehensive responses*

Mean Entropy — *Figure 4: Mean entropy comparison - HAPO preserves higher entropy, indicating better exploration and diversity*

## Citation If you find our work interesting and helpful, please consider giving our repo a star. Additionally, if you would like to cite our work, please use the following format: ```bibtex @misc{liu2025hapo, title={From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature}, author={Zheng Liu and Mengjie Liu and Siwei Wen and Mengzhang Cai and Bin Cui and Conghui He and Lijun Wu and Wentao Zhang}, year={2025}, eprint={2509.16591}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.16591}, } ``` ## Contact If you have any questions or suggestions, please feel free to contact us at `2501213330@stu.pku.edu.cn`. ## Community efforts * This repository is based on [verl](https://github.com/volcengine/verl/tree/main) project.