Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing
Quick Answer
This paper shows that The Local Branch Routing (LBR) framework enhances language model test-time scaling by enabling efficient token-level decision-making, outperforming traditional methods on mathematical reasoning benchmarks.
Quick Take
The Local Branch Routing (LBR) framework enhances language model test-time scaling by enabling efficient token-level decision-making, outperforming traditional methods on mathematical reasoning benchmarks. LBR improves Pass@1 and Pass@32 metrics, demonstrating a significant advantage over discrete chain-of-thought and soft-token branching methods, while allowing for end-to-end reinforcement learning.
Key Points
- LBR uses a lightweight router to select optimal token decisions from a local lookahead tree.
- The framework preserves discrete branch identities, enabling tractable tree-trajectory likelihood.
- LBR shows improved performance on synthetic hierarchical-planning tasks and mathematical reasoning benchmarks.
- End-to-end reinforcement learning is facilitated with verifiable rewards under the same likelihood-ratio principle.
- LBR demonstrates significant improvements over existing discrete-token RLVR and soft-token methods.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 25354v1 Announce Type: new Abstract: Test-time scaling improves language-model reasoning, but existing approaches often face a difficult trade-off: long chain-of-thought sampling remains single-threaded, while sentence- or solution-level search can be computationally expensive and hard to train end-to-end.
We introduce Local Branch Routing (LBR), a token-level test-time scaling framework that expands a small local lookahead tree, forwards all sampled branches through the language model, and uses a lightweight router to select the depth-1 subtree to commit. By routing over the hidden states of candidate local futures, LBR allows each token decision to use evidence beyond the root next-token distribution while avoiding full solution-level search.
The resulting prune-shift-grow decoding process preserves discrete branch identities and defines a tractable tree-trajectory likelihood: newly grown nodes are counted when first sampled, and router decisions are assigned explicit probabilities. This enables end-to-end reinforcement learning with verifiable rewards, jointly optimizing the base model and router under the same likelihood-ratio principle as discrete-token RLVR.
On synthetic hierarchical-planning tasks, LBR shows that post-candidate hidden states provide useful routing evidence. On mathematical reasoning benchmarks, LBR improves both Pass@1 and Pass@32 over discrete chain-of-thought, vanilla discrete-token RLVR, and RL-compatible soft-token branching baselines. These results suggest that lightweight local branching offers an efficient, trainable, and discrete form of language-model test-time scaling.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.