Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
Quick Take
The study reveals that LLM confidence calibration is highly sensitive to measurement choices, affecting the comparison of token-probability scores and verbalized confidence across various QA benchmarks. Instruct settings show minimal calibration gain, while plausible wrong answers receive similar confidence levels to correct ones, indicating a need for protocol-dependent behavioral measurements.
Key Points
- Evaluation on four QA benchmarks using 7-8B base/Instruct models.
- Conditioning context significantly alters ECE gap signs and magnitudes.
- Verbalized confidence reflects answer plausibility, not just correctness.
- Default settings show Instruct models near parity in calibration.
- A reporting checklist is proposed for confidence measurement protocols.
Article Content
From source RSS / original summaryarXiv:2605. 27752v1 Announce Type: new Abstract: LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format.
We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2. 5 variants as same-family robustness checks.
The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence.
In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.