Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Paper • 2605.12483 • Published • 10
Hi @sseymens
Thank you for your comments.
I can help to reply your question about MOE on policy part.
old_log_prob = log_prob.detach() does not solve the on policy issue since the prob is using current policy but sampling distribution can be different due to expert selection.old_log_prob = log_prob.detach() will alleviate the issue if this is the root cause. This is just for hypothesis testing.