CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

arXiv cs.AI·Fangzhou Lin, Shuo Xing, Peiran Li, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhengzhong Tu

5/18/2026

·~2 min·5/18/2026·en·1

Quick Answer

This paper shows that CAPS (Cascaded Adaptive Pairwise Selection) optimizes parallel reasoning in large language models by reducing verifier-token costs by approximately 50%.

Quick Take

CAPS (Cascaded Adaptive Pairwise Selection) optimizes parallel reasoning in large language models by reducing verifier-token costs by approximately 50%. It outperforms leading pairwise verifiers on 14 of 20 benchmarks, using only 25.4% of the verifier-token budget on code tasks. This framework adapts evidence and distribution axes to enhance efficiency in self-verification across multiple models, including Qwen3-14B and GPT-OSS-20B.

Key Points

CAPS reduces per-candidate marginal costs by roughly half compared to uniform full-evidence schedules.
It employs a four-stage cascade with an optional rescue subroutine for efficient verification.
Outperforms pointwise self-verification across all 20 reasoning benchmarks tested.
Demonstrated effectiveness on models like Qwen3-14B and GPT-OSS-20B.
Provides interpretable diagnostics for assessing cascade suitability pre-deployment.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

View PDF HTML (experimental)

Abstract:Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.

Comments:	31 pages, 2 figures, 18 tables
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.15513 [cs.AI]
	(or arXiv:2605.15513v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.15513 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Fangzhou Lin [view email]
[v1] Fri, 15 May 2026 01:16:12 UTC (220 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy