Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
Quick Take
The proposed Trajectory-aware On-Policy Distillation (TOPD) enhances reasoning in large language models, achieving a performance boost from 48.2% to 52.2% accuracy. By utilizing near-future trajectory information, TOPD effectively identifies and addresses real divergent states, leading to significant improvements on benchmarks like AIME24 and AIME25.
Key Points
- TOPD improves OPD accuracy from 48.2% to 52.2% using near-future trajectory guidance.
- Suppressing non-divergent high-loss tokens enhances standard OPD performance.
- AIME24 scores increased from 60.0% to 63.3% with TOPD.
- AIME25 scores improved from 46.7% to 53.3% using the new method.
- 30% of high-loss tokens are identified as surface-form mismatches.
Article Content
From source RSS / original summaryarXiv:2606. 00305v1 Announce Type: new Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction.
We show that this "trajectory-sampled but token-learned" mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift.
We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47. 8% to 48. 2% average accuracy, while TOPD further improves performance to 52. 2%, with gains on AIME24 from 60. 0% to 63. 3% and AIME25 from 46. 7% to 53. 3%.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.