Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
Quick Take
The paper critiques the evaluation metrics for LLMs acting as judges, emphasizing the redundancy in reporting multiple correlation coefficients for binary outcomes. It highlights Cohen's kappa as the most informative metric and discusses the implications of handling abstentions in evaluations, proposing a comprehensive reporting checklist.
Key Points
- 24 recent LLM-as-judge papers show inconsistent metric reporting practices.
- Cohen's kappa uniquely provides insight into judge-human label discrepancies.
- Redundant metrics like Pearson's r and Spearman's rho do not enhance evaluation.
- Handling abstentions affects binary equivalences and evaluation outcomes.
- A checklist for reporting includes judgment scale, tie handling, and confusion matrix.
Article Content
From source RSS / original summaryarXiv:2606. 00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated.
For binary criteria -- the common case in rubric-based evaluation, where each criterion is graded MET or UNMET -- most of the reported numbers are redundant: Pearson's $r$, Spearman's $\rho$, Kendall's $\tau_b$, the phi coefficient $\phi$, and the Matthews Correlation Coefficient all reduce to a single number on non-degenerate binary data, so reporting several of them only creates an illusion of corroborating evidence.
Cohen's $\kappa$ is the one agreement coefficient that adds information: it shares $\phi$'s numerator but normalizes differently, and the gap between them measures how far the judge's positive-label rate has drifted from the human's. We then trace what changes when a judge may abstain with a CANNOT_ASSESS verdict: the three common ways of handling abstentions are not interchangeable preprocessing choices but answer different questions, and they break the binary equivalences.
The same equivalences reappear, up to a negligible finite-sample correction, for multi-judge ensembles scored with Fleiss' $\kappa$ or Krippendorff's $\alpha$. We close with a reporting checklist that names the judgment scale, the abstention and tie handling mode, coverage, the confusion matrix, and the aggregation level alongside any scalar agreement coefficient.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.