Hejian Sang's picture

Hejian Sang

pb09204048

·

AI & ML interests

None yet

Recent Activity

upvoted a paper 16 days ago

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

authored a paper about 1 month ago

TIP: Token Importance in On-Policy Distillation

upvoted a paper about 1 month ago

TIP: Token Importance in On-Policy Distillation

View all activity

Organizations

upvoted a paper 16 days ago

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Paper • 2605.12483 • Published 17 days ago • 10

authored a paper about 1 month ago

TIP: Token Importance in On-Policy Distillation

Paper • 2604.14084 • Published Apr 15 • 15

upvoted a paper about 1 month ago

TIP: Token Importance in On-Policy Distillation

Paper • 2604.14084 • Published Apr 15 • 15

liked a dataset about 2 months ago

ianncity/KIMI-K2.5-1000000x

Viewer • Updated Apr 7 • 733k • 3.61k • 262

upvoted a paper 3 months ago

On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published Mar 5 • 9

submitted a paper to Daily Papers 3 months ago

On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published Mar 5 • 9

authored a paper 3 months ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published Feb 24 • 6

upvoted a paper 3 months ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published Feb 24 • 6

submitted a paper to Daily Papers 3 months ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published Feb 24 • 6

commented 2 papers 3 months ago

Reinforcement Learning via Self-Distillation

Paper • 2601.20802 • Published Jan 28 • 47 •

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Paper • 2601.18734 • Published Jan 26 • 7 •

commented on Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective 4 months ago

Hi @sseymens
Thank you for your comments.
I can help to reply your question about MOE on policy part.

Yeah, forcing old_log_prob = log_prob.detach() does not solve the on policy issue since the prob is using current policy but sampling distribution can be different due to expert selection.
When we explored the agentic issues for gpt-oss training, we did not root the cause at the beginning. One hypothesis is due to inference-training inconsistency. After we apply the importance sampling, it does not help. So we test if forcing old_log_prob = log_prob.detach() will alleviate the issue if this is the root cause. This is just for hypothesis testing.
When we explored the agentic issues for gpt-oss training, verl has not supported expert router replay yet. So we cannot test this idea. https://arxiv.org/pdf/2510.11370v1. Now we tested the relay. But this is not the root cause too. The root cause is attention sink.

published an article 4 months ago

Article

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

LinkedIn

•

Jan 27

• 76

authored a paper 4 months ago

Debunk the Myth of SFT Generalization

Paper • 2510.00237 • Published Sep 30, 2025 • 2

upvoted 2 papers 8 months ago

Debunk the Myth of SFT Generalization

Paper • 2510.00237 • Published Sep 30, 2025 • 2

Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs

Paper • 2509.25779 • Published Sep 30, 2025 • 19

liked 4 datasets 8 months ago

anisha2102/RaR-Science-20k-o3-mini

Viewer • Updated Oct 5, 2025 • 22.9k • 53 • 4

anisha2102/RaR-Medicine-20k-o3-mini

Viewer • Updated Oct 5, 2025 • 22.4k • 97 • 6

LLM360/guru-RL-92k

Viewer • Updated Aug 20, 2025 • 91.9k • 13.2k • 46

a-m-team/AM-Thinking-v1-Distilled

Preview • Updated Jun 12, 2025 • 1.15k • 58