Auditing LLM Benchmarks with Item Response Theory
Quick Take
A new Item Response Theory-based indicator identifies mislabels in LLM benchmarks with 95% precision, outperforming supervised classifiers. Errors stem from mechanical labeling heuristics and ambiguous items, with one reward model achieving 78% accuracy in detecting mislabels, highlighting issues of benchmark contamination.
Key Points
- Item Response Theory indicator surfaces likely mislabels in top 200 examples.
- Errors traced to mechanical labeling heuristics and ambiguous items.
- One reward model agrees with detected mislabels at 78% accuracy.
- Supervised classifiers performed worse than the new indicator.
- Benchmark contamination affects downstream benchmarks.
Article Excerpt
From source RSS / original summaryarXiv:2605. 30504v1 Announce Type: new Abstract: LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier.
We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.