Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling
Quick Answer
The proposed C3RL algorithm enhances the calibration of large language models (LLMs) by integrating correctness and confidence rewards, outperforming existing methods.
Quick Take
The proposed C3RL algorithm enhances the calibration of large language models (LLMs) by integrating correctness and confidence rewards, outperforming existing methods. Coupled with the CAS strategy, it allows for adaptive resource allocation based on confidence, achieving up to 12.33 times reduction in inference costs while improving performance on 8 datasets.
Key Points
- C3RL integrates correctness and confidence rewards for better LLM calibration.
- Outperforms state-of-the-art methods in performance and calibration metrics.
- CAS allocates computational resources based on response confidence.
- Achieves up to 12.33 times reduction in inference budget.
- Code, data, and models will be publicly released.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01612v1 Announce Type: new Abstract: Training large language models (LLMs) with reinforcement learning (RL) has significantly advanced their performance on reasoning and question-answering tasks. However, prevailing RL reward designs typically prioritize response correctness, neglecting to incentivize models to express their confidence accurately.
This leads to a critical problem: performance gains are often accompanied by poor calibration between confidence and accuracy, misleading models to overconfidently hallucinate when uncertain. To address this limitation, we propose $\textbf{C}$orrectness and $\textbf{C}$onfidence $\textbf{C}$alibration $\textbf{R}$einforcement $\textbf{L}$earning ($\textbf{C3RL}$), a novel RL algorithm integrating correctness, calibration and dataset-informed reference accuracy rewards together.
Comprehensive evaluation across 8 text and multimodal datasets demonstrates that C3RL enhances calibration without sacrificing accuracy, outperforming the current state-of-the-art method in both performance and calibration metrics. Utilizing the well-calibrated verbalized confidence from C3RL, we further introduce $\textbf{C}$onfidence-based $\textbf{A}$daptive Test Time $\textbf{S}$caling ($\textbf{CAS}$), an adjustable inference-time strategy that allocates computational resources based on response confidence.
Experiments show that CAS surpasses majority voting on both in-domain and out-of-domain datasets while reducing the inference budget by up to 12. 33 times. We believe the synergy of C3RL and CAS paves the way for deploying more reliable and resource-efficient LLMs. The code, data and models will be released.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.