Paper Index

Section under construction. Feel free to contribute!

Group Sequence Policy Optimization

📜 Paper: https://huggingface.co/papers/2507.18071

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token. To reproduce the paper’s setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    beta=0.0,  # GSPO set KL regularization to zero: https://github.com/volcengine/verl/pull/2775#issuecomment-3131807306 
    epsilon=3e-4,  # GSPO paper (v2), section 5.1
    epsilon_high=4e-4,  # GSPO paper (v2), section 5.1
    gradient_accumulation_steps=1,
    steps_per_generation=4,  # partition rollout batch into 4 mini-batches. GSPO paper (v2), section 5.1. Must be 4 times gradient_accumulation_steps
)

Note that this method only has an effect when training goes slightly off-policy—for example, when steps_per_generation > gradient_accumulation_steps or num_iterations > 1. Otherwise, it is effectively equivalent to no modification.

Policy ratio: GRPO vs. GSPO

In GSPO, the policy ratio is defined at the sequence-level. In other words, it is the ratio between the probability of the current policy generating a sequence over the old policy generating that same sequence.

The sequence likelihood is defined as: $\pi_\theta (o_i \mid q) = \prod_{t=1}^{|o_i|} \pi_\theta (o_{i,t} | q, o_{i, \lt t} ),$

where $\pi_\theta$ is the policy $\pi$ with parameters $\theta$ , $o_i$ is the $i$ -th output sequence $o$ and $y_{i,t}$ is the $t$ -th token in this sequence, $q$ is the input query. The sequence likelihood ratio $s_i (\theta)$ is defined as: $s_i (\theta) = \left(\frac{\pi_\theta (o_i | q)}{\pi_{\theta_{old}} (o_i | q)} \right)^{\frac{1}{|o_i|}}$

The exponent $\frac{1}{|y_i|}$ represents a sequence-length normalization, minimizing the influence of sequence lenght in sequence likelihood. In other terms, it computes the geometric mean of token probabilities, ensuring a fair comparison across sequences of varying lengths.

While GSPO defines the policy ratio at the sequence level, GRPO operates at the token level. Specifically, GRPO computes an importance ratio for each token in the sequence: $w_{i,t}(\theta) = \frac{\pi_\theta (o_{i,t} \mid q, o_{i,\lt t})}{\pi_{\theta_{\text{old}}} (o_{i,t} \mid q, o_{i,\lt t})}$

This token-level ratio is then combined with a shared advantage $\hat{A}_i$ , and the GRPO objective clips and optimizes each token independently across the sequence.

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

📜 Paper: https://huggingface.co/papers/2503.14476

The DAPO algorithm includes 5 key components:

Overlong Filtering
Clip-Higher
Soft Overlong Punishment
Token-level Loss
Dynamic Sampling (⚠️ Not supported in TRL)

To reproduce the paper’s setting, use this configuration:

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    # Overlong Filtering
    mask_truncated_completions=True,
    # Token-level Loss
    loss_type="dapo",
    # Clip-Higher
    epsilon_high=0.28, # DAPO paper: section 4.1
    epsilon=0.2, # DAPO paper: section 4.1
    # Other parameters used
    per_device_train_batch_size=512, # mini-batch size for training in the paper, DAPO paper: section 4.1
    num_generations=16, # number of sample responses in the paper, DAPO paper: section 4.1
    max_completion_length=20480, #  maximum number of tokens for generation in the paper, DAPO paper: section 4.1
    beta=0.0 # section 2.3, DAPO paper

)
# Soft Overlong Punishment
sop_reward = get_soft_overlong_punishment(max_completion_len=20480, soft_punish_cache=4096) # DAPO paper: section 4.1
trainer = GRPOTrainer(
    ...,
    args=training_args,
    reward_funcs=[..., sop_reward],
)

Dr. GRPO: Understanding R1-Zero-Like Training: A Critical Perspective

📜 Paper: https://huggingface.co/papers/2503.20783

A study of R1-Zero training identifies pretraining effects on RL performance and proffers Dr. GRPO to enhance token efficiency, achieving superior accuracy on AIME 2024. To reproduce the paper’s setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    loss_type="dr_grpo",
    per_device_train_batch_size=1, # train_batch_size_per_device in the Training section of the repository
    num_generations=8, #  num_samples in the Training section of the repository
    max_prompt_length=1024, #  prompt_max_length in the Training section of the repository
    max_completion_length=3000, # generate_max_length in the Training section of the repository
    beta=0.0, # beta in the Training section of the repository
)

Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model

📜 Paper: https://huggingface.co/papers/2305.18290

Direct Preference Optimization (DPO) fine-tunes language models more efficiently and with better performance compared to reinforcement learning from human feedback (RLHF), by directly optimizing policy training based on human preferences. To reproduce the paper’s setting, use this configuration:

from trl import DPOConfig

training_args = DPOConfig(
    loss_type="sigmoid", # losses in Appendix B of the paper
    per_device_train_batch_size=64, #  batch size in Appendix B of the paper
    learning_rate=1e-6, # learning rate in Appendix B of the paper
    beta=0.1, # beta in Appendix B of the paper
)

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

📜 Paper: https://huggingface.co/papers/2402.14740

RLOO is a variant of REINFORCE that reduces variance by using leave-one-out baselines. It computes rewards by comparing each sample against the average of all other samples in the batch, providing more stable gradients than standard REINFORCE. To reproduce the paper’s setting, use this configuration:

from trl import RLOOConfig

training_args = RLOOConfig(
    per_device_train_batch_size=512,  # section C Training Detail of the paper
    steps_per_generation=2  # section C Training Detail of the paper
    beta=0.03  # section C Training Detail of the paper
    num_generations=2,  # experiments of paper different num_generations={2,4}
    learning_rate=1e-6  # section C Training Detail of the paper
)

AlphaPO — Reward shape matters for LLM alignment

📜 Paper: https://huggingface.co/papers/2501.03884

AlphaPO is a new Direct Alignment Algorithms (DAAs) method that leverages an alpha-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. To reproduce the paper’s setting, use this configuration:

from trl import CPOConfig

# Mistral-Instruct from Table 3 of the paper
training_args = CPOConfig(
    loss_type="alphapo",
    alpha=0.25,
    beta=2.5,
    simpo_gamma=0.1,
    learning_rate=7e-7,
    ...
)

EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes

📜 Paper: https://huggingface.co/papers/2508.00180

Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can use the BEMACallback:

from trl import BEMACallback, SFTTrainer

trainer = SFTTrainer(
    ...
    callbacks=[BEMACallback()],
)

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)

📜 Paper: https://huggingface.co/papers/2508.08221

The authors of this paper find that the combination of:

scaling rewards by the standard deviation computed over the entire batch and
aggregating loss over the total number of tokens

can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

TRL supports using these learnings to train a GRPO model by:

from trl import GRPOConfig

training_args = GRPOConfig(
    ...
    scale_rewards="batch",
    loss_type="dapo",
    # Other parameters used
    beta=0.0,  # = init_kl_coef in the paper
    top_p=0.99,
    top_k=100,
    temperature=0.99,
    num_completions=8, # = num_return_sequences in the paper
    num_iterations=1,  # = ppo_epochs in the paper
    per_device_train_batch_size=4,
    gradient_accumulation_steps=32,
    steps_per_generation=8,  # (rollout_batch_size*num_return_sequences) / (per_device_train_batch_size*gradient_accumulation_steps)
)

Note that when using gradient accumulation, the loss is aggregated over the total number of tokens in the batch, but not over the accumulated batch. For more details, see the GRPO Trainer - Loss types.

< > Update on GitHub