When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

arXiv cs.AI·Zhe Dong (University of Maine at Presque Isle), Fang Qin (Stanford University), Manish Shah (Independent Researcher)

2h ago

·~2 min·7/1/2026·en·0

Quick Answer

This paper shows that LearnStop, a checkpoint stopper for reasoning models, shows task-dependent benefits in early exits.

Quick Take

LearnStop, a checkpoint stopper for reasoning models, shows task-dependent benefits in early exits. In free-form math tasks like GSM8K with Qwen3-32B, it achieves a +0.157 peak adapt gain, outperforming scalar exits, while scalar rules remain competitive in multiple-choice settings.

Key Points

LearnStop improves early exits in reasoning models, especially in free-form math tasks.
Achieved +0.157 peak adapt gain on GSM8K with Qwen3-32B, outperforming scalar exits.
Scalar confidence and stability rules are competitive in multiple-choice and hard settings.
Learned stopping is beneficial when questions are correct before full budget use.
Validation-selected operating points and robustness checks support the findings.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 30852v1 Announce Type: new Abstract: Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with LearnStop, a hidden-state-free checkpoint stopper for reasoning language models.

At fixed budget checkpoints, LearnStop probes a short answer from the current reasoning prefix and predicts prefix correctness from online features such as answer confidence, entropy, prefix vote share, answer stability, and backtracking-marker density. Across 18 task-model settings spanning GSM8K, MATH-500, , AIME-90, , Qwen3, and DeepSeek-R1 distillations, the answer is task-dependent.

On free-form math, learned multi-feature stopping improves the fixed-budget frontier and often beats scalar exits: on GSM8K with Qwen3-32B, the empirical frontier reaches a post-hoc peak adapt gain of +0. 157, validation-selected operating points preserve positive gains, and the paired gain over the strongest scalar baseline is +0. 028. On multiple-choice and very hard settings, scalar confidence, entropy, or stability rules are competitive or stronger.

We therefore frame learned stopping not as a universal replacement for scalar exits, but as a tool whose value depends on trajectory structure. We further provide validation-selected operating points, paired bootstrap tests, finite-grid lost-correct risk calibration, cost accounting under KV-fork, prefix-cache, and black-box regimes, H100 serving profiles, checkpoint-schedule sweeps, transfer analyses, and robustness checks.

The main practical finding is that learned stopping is useful when many questions become correct before full budget but do not exhibit a single reliable scalar stopping signal; its benefits largely disappear when confidence or answer convergence already solves the stopping problem.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

5d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy