HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
Abstract
HINT-SD is a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training efficiency and effectiveness.
Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.
Community
We propose HINT-SD, a targeted hindsight self-distillation framework for long-horizon agents that improves learning by identifying and correcting only the actions responsible for task failure. Instead of distilling entire trajectories, HINT-SD performs hindsight analysis to isolate failure-critical decisions and conducts self-distillation on each targeted turn, with the teacher conditioned on generated hindsight feedback.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents (2026)
- Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents (2026)
- Co-Evolution of Policy and Internal Reward for Language Agents (2026)
- Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision (2026)
- Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing (2026)
- Self-Supervised On-Policy Distillation for Reasoning Language Models (2026)
- GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.17873 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper