When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs
Quick Answer
The study introduces ACE, an accuracy-controlled evaluation framework for fair comparison of LLMs, revealing that raw global calibration metrics often misrepresent model performance.
Quick Take
The study introduces ACE, an accuracy-controlled evaluation framework for fair comparison of LLMs, revealing that raw global calibration metrics often misrepresent model performance. It shows that many models favored by these metrics lose their advantage when accuracy is considered, highlighting the need for accuracy-aware evaluation in LLM calibration comparisons.
Key Points
- ACE evaluates LLM calibration through Instance-Aligned, Distribution-Aligned, and Candidate-Aligned views.
- Raw global metrics often misrepresent model calibration, leading to frequent ranking reversals.
- The study analyzes small vs. large models and thinking vs. non-thinking models across benchmarks.
- Accuracy-aware evaluation is essential for fair comparisons of LLMs.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 30814v1 Announce Type: new Abstract: Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibration of different large language models using global calibration metrics such as Expected Calibration Error and Brier Score. We begin by showing, both theoretically and empirically, that such comparisons are confounded by differences in model accuracy.
For fairer cross-model comparison, we then propose ACE, an accuracy-controlled evaluation framework with three complementary views: Instance-Aligned, Distribution-Aligned, and Candidate-Aligned calibration. Across multiple benchmarks, model families, and confidence elicitation methods, we use ACE to study two practically important comparison axes, small versus large models and thinking versus non-thinking models.
We find that many previously reported calibration advantages under raw global metrics weaken substantially after accuracy control. We also find that ranking reversal is frequent: models favored by raw metrics often cease to be favored once accuracy is controlled. Our results show that raw global calibration metrics are not robust for cross-model comparison, and that fair calibration comparison requires accuracy-aware evaluation.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.