Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

arXiv cs.CL·Yuxuan Jiang, Francis Ferraro

2h ago

·~1 min·6/2/2026·en·0

Quick Take

The proposed Trajectory-aware On-Policy Distillation (TOPD) enhances reasoning in large language models, achieving a performance boost from 48.2% to 52.2% accuracy. By utilizing near-future trajectory information, TOPD effectively identifies and addresses real divergent states, leading to significant improvements on benchmarks like AIME24 and AIME25.

Key Points

TOPD improves OPD accuracy from 48.2% to 52.2% using near-future trajectory guidance.
Suppressing non-divergent high-loss tokens enhances standard OPD performance.
AIME24 scores increased from 60.0% to 63.3% with TOPD.
AIME25 scores improved from 46.7% to 53.3% using the new method.
30% of high-loss tokens are identified as surface-form mismatches.

Article Content

From source RSS / original summary

arXiv:2606. 00305v1 Announce Type: new Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction.

We show that this "trajectory-sampled but token-learned" mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift.

We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47. 8% to 48. 2% average accuracy, while TOPD further improves performance to 52. 2%, with gains on AIME24 from 60. 0% to 63. 3% and AIME25 from 46. 7% to 53. 3%.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy