Auditing LLM Benchmarks with Item Response Theory · DeepSignal

Auditing LLM Benchmarks with Item Response Theory

arXiv cs.CL·Sander Land, Daniel M. Bikel

4h ago

·~1 min·6/1/2026·en·0

Quick Take

A new Item Response Theory-based indicator identifies mislabels in LLM benchmarks with 95% precision, outperforming supervised classifiers. Errors stem from mechanical labeling heuristics and ambiguous items, with one reward model achieving 78% accuracy in detecting mislabels, highlighting issues of benchmark contamination.

Key Points

Item Response Theory indicator surfaces likely mislabels in top 200 examples.
Errors traced to mechanical labeling heuristics and ambiguous items.
One reward model agrees with detected mislabels at 78% accuracy.
Supervised classifiers performed worse than the new indicator.
Benchmark contamination affects downstream benchmarks.

Article Excerpt

From source RSS / original summary

arXiv:2605. 30504v1 Announce Type: new Abstract: LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier.

We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy