Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring
Quick Answer
This study reveals that count-based F1 scores can inflate without genuine improvements in error detection, highlighting a significant gap termed F1 Inflation.
Quick Take
This study reveals that count-based F1 scores can inflate without genuine improvements in error detection, highlighting a significant gap termed F1 Inflation. Using ErrorBench, it was found that anchored prompts can inflate F1 scores by up to 0.79 points, suggesting that LLM evaluations should prioritize span-aware metrics over pre-populated error counts.
Key Points
- Count-based F1 scores can inflate significantly without improving span localization.
- ErrorBench evaluated six LLMs, revealing up to 0.79 points of F1 Inflation.
- Blind-to-Anchored prompt shift raised Count-F1 by +0.21 on average.
- GPT/Claude systems produced larger count responses under stress testing.
- LLM evaluations should report span-aware metrics alongside count-based metrics.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01240v1 Announce Type: new Abstract: Count-based F1 is widely used as a proxy for LLM error-detection quality, but this paper shows that it can rise dramatically without a corresponding improvement in span localization, a gap termed F1 Inflation. The paper introduces ErrorBench, a controlled stress-test protocol for prompt-induced count distortion. ErrorBench evaluates six contemporary LLMs under five prompt conditions over 4,290 responses from 143 CoNLL-2014 passages.
Under CoNLL-2014 M2-style scoring, anchored prompts produce up to 0. 79 points of F1 Inflation, and up to 0. 96 under strict matching. A 100-passage replication using the official ERRANT 3. 0. 0 pipeline and multi-reference scoring reproduces the pattern: averaged over six models, the Blind-to-Anchored prompt shift raises Count-F1 by +0. 21 while raising multi-reference ERRANT F0. 5 by only +0. 04.
The study finds larger count responses in highly instruction-compliant GPT/Claude systems and smaller responses in the Gemini family under this stress-test protocol. The findings suggest that LLM proofreading and document-review evaluations should avoid pre-populated error counts and should report span-aware metrics alongside count-based metrics.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.