Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

arXiv cs.CL·Chia-Hsuan Lee, Sihui Dai, Mingyang Zhou, Isha Slavin, Shi-Xiong Zhang, Sambit Sahu, William Campbell

3h ago

·~1 min·7/2/2026·en·0

Quick Answer

The DASH method improves reasoning in language models by segment-level credit assignment, reducing overthinking behaviors and achieving 50.8% accuracy on AIME25 benchmarks compared to 45.4% for GRPO.

Quick Take

The DASH method improves reasoning in language models by segment-level credit assignment, reducing overthinking behaviors and achieving 50.8% accuracy on AIME25 benchmarks compared to 45.4% for GRPO. This approach identifies productive self-reflection without costly annotations, enhancing performance in competitive math tasks.

Key Points

DASH assigns credit based on reasoning segment contributions towards correctness.
The method reduces unproductive self-reflection in language models.
Achieved 50.8% accuracy on AIME25, outperforming GRPO by 5.4%.
Intermediate answer commitments serve as a low-cost proxy for reflection evaluation.
DASH enhances self-correction capabilities in reasoning tasks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2607. 00482v1 Announce Type: new Abstract: Reasoning language models frequently overthink: generating extended chains of behaviors such as hedging, approach abandonment, and self contradiction that consume tokens without improving answers. We show that these behaviors are not merely a consequence of length; even when controlling for response length, incorrect traces exhibit higher rates of unproductive self-reflection than correct ones.

Addressing this requires identifying where self-reflection helps vs hurts, but obtaining these step-level annotations is costly. We observe that intermediate answer commitments within reasoning traces can provide a cheap proxy: by comparing each final answer candidate in the trace to the ground truth, we can determine whether subsequent reflection is productive without any additional supervision.

Building on this insight, we propose DASH (Drift Aware advantage SHaping), which assigns segment-level credit based on whether each reasoning segment leads toward or away from correctness. On competition-level math benchmarks, DASH achieves the highest accuracy where overthinking is prevalent (AIME25: 50. 8% vs. 45. 4% GRPO) while reducing overthinking behaviors and achieving more productive self-correction than baselines.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems