The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

6h ago

·~2 min·6/15/2026·en·0

Quick Answer

Quick Take

The study reveals that LLM-as-a-Judge models, specifically GPT-4o-mini and GPT-4.1-mini, show significant reliability issues, with 13.6% of pairwise preferences flipping and only 76% cross-judge agreement. Multi-trial aggregation and position randomization are recommended for high-stakes evaluations.

Key Points

GPT-4o-mini shows a first-position bias of 72% in majority decisions.
Mean pointwise score gaps are small (0.19-0.36) and not statistically significant.
Semantic prompt variations change majority outcomes in 25% of cases.
11 repeated trials are needed for 95% reliability in majority votes.
Single-trial evaluations are too noisy for high-stakes decisions.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4. 1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations.

Across judges, pairwise preferences flip on average 13. 6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0. 024). At the same time, mean pointwise score gaps are small (0. 19--0.

36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise--pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($\kappa = 0. 51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency.

A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice.

Because both judges are from a single provider, cross-provider replication remains an important next step.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy