Calibrating LLMs with Semantic-level Reward

arXiv cs.CL·Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

4d ago

·~2 min·5/18/2026·en·0

Quick Take

The CSR framework enhances LLM calibration by using semantic rewards instead of verbalized confidence scores.

Key Points

CSR combines correctness and semantic calibration rewards.
Reduces expected calibration error by up to 40%.
Improves AUROC by up to 31% over baselines.

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

View PDF HTML (experimental)

Abstract:As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2605.15588 [cs.CL]
	(or arXiv:2605.15588v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.15588 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ruijia Niu [view email]
[v1] Fri, 15 May 2026 03:55:11 UTC (2,364 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Calibrating LLMs with Semantic-level Reward

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets