UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

arXiv cs.AI·Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi, Yuqian Hou, Hong Xu, Zhongxiang Dai, Minchen Yu, Qingjiang Shi

6/1/2026

·~2 min·6/1/2026·en·4

Quick Answer

UniScale introduces Unified Inference Scaling (UIS) to optimize model routing and test-time scaling for large language models, enhancing adaptability in dynamic environments.

Quick Take

UniScale introduces Unified Inference Scaling (UIS) to optimize model routing and test-time scaling for large language models, enhancing adaptability in dynamic environments. By framing this as a contextual multi-armed bandit problem, it achieves a superior quality-cost trade-off, outperforming existing methods in diverse inference scenarios.

Key Points

UniScale combines model routing and test-time scaling in a unified optimization framework.
The framework uses LinUCB for learning inference policies effectively.
It addresses limitations of existing methods by enabling fine-grained performance adjustments.
Evaluation shows improved quality-cost trade-offs across various dynamic inference scenarios.
Efficiency-aware learning ensures stable optimization over high-dimensional action spaces.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 29 May 2026]

View PDF HTML (experimental)

Abstract:In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.

Comments:	Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2605.30898 [cs.AI]
	(or arXiv:2605.30898v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.30898 arXiv-issued DOI via DataCite

Submission history

From: Kaiyu Huang [view email]
[v1] Fri, 29 May 2026 06:31:21 UTC (373 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·David Krongauz, Arad Zulti, Eran Segal, Teddy Lazebnik

6h ago

FeaturedOriginal

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

AI Summary

The MEDA system utilizes large language models and symbolic regression to autonomously discover ordinary differential equations for biological systems, achieving strong structural recovery and biologically plausible models. It outperforms existing methods by integrating domain knowledge and mechanistic constraints, demonstrating effective retrieval and extrapolation capabilities.

#LLM #Agent #Inference #AI Startup