Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling
Quick Answer
This paper shows that The GRACE framework optimizes verification granularity in test-time scaling for large language models, demonstrating that fine-grained verification excels under high compute budgets or difficult problems, while coarse-grained is better for low budgets and easier tasks.
Quick Take
The GRACE framework optimizes verification granularity in test-time scaling for large language models, demonstrating that fine-grained verification excels under high compute budgets or difficult problems, while coarse-grained is better for low budgets and easier tasks. Empirical results show a 3.1% accuracy improvement over fixed strategies on benchmarks like MATH-500 and GSM8K.
Key Points
- GRACE framework defines optimal verification granularity based on problem difficulty and compute budget.
- Fine-grained verification is preferred for high-complexity tasks with sufficient compute resources.
- Coarse-grained verification is more effective for low-budget, simpler problems.
- Empirical tests on MATH-500, GSM8K, and AIME validate theoretical claims.
- Adaptive strategies outperform fixed-granularity approaches by up to 3.1% in accuracy.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19354v1 Announce Type: new Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process.
While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget? } Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes.
In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget.
We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier.
Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3. 1\% accuracy at matched compute.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.