Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking
Quick Answer
The DASH method improves reasoning in language models by segment-level credit assignment, reducing overthinking behaviors and achieving 50.8% accuracy on AIME25 benchmarks compared to 45.4% for GRPO.
Quick Take
The DASH method improves reasoning in language models by segment-level credit assignment, reducing overthinking behaviors and achieving 50.8% accuracy on AIME25 benchmarks compared to 45.4% for GRPO. This approach identifies productive self-reflection without costly annotations, enhancing performance in competitive math tasks.
Key Points
- DASH assigns credit based on reasoning segment contributions towards correctness.
- The method reduces unproductive self-reflection in language models.
- Achieved 50.8% accuracy on AIME25, outperforming GRPO by 5.4%.
- Intermediate answer commitments serve as a low-cost proxy for reflection evaluation.
- DASH enhances self-correction capabilities in reasoning tasks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 00482v1 Announce Type: new Abstract: Reasoning language models frequently overthink: generating extended chains of behaviors such as hedging, approach abandonment, and self contradiction that consume tokens without improving answers. We show that these behaviors are not merely a consequence of length; even when controlling for response length, incorrect traces exhibit higher rates of unproductive self-reflection than correct ones.
Addressing this requires identifying where self-reflection helps vs hurts, but obtaining these step-level annotations is costly. We observe that intermediate answer commitments within reasoning traces can provide a cheap proxy: by comparing each final answer candidate in the trace to the ground truth, we can determine whether subsequent reflection is productive without any additional supervision.
Building on this insight, we propose DASH (Drift Aware advantage SHaping), which assigns segment-level credit based on whether each reasoning segment leads toward or away from correctness. On competition-level math benchmarks, DASH achieves the highest accuracy where overthinking is prevalent (AIME25: 50. 8% vs. 45. 4% GRPO) while reducing overthinking behaviors and achieving more productive self-correction than baselines.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.