Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs
Quick Answer
This study reveals that automatic metrics for LLM retrieval-augmented generation do not transfer reliably across datasets, with significant performance variations; for instance, an NLI scorer drops from AUROC 0.90 to 0.53 on different datasets.
Quick Take
This study reveals that automatic metrics for LLM do not transfer reliably across datasets, with significant performance variations; for instance, an NLI scorer drops from AUROC 0.90 to 0.53 on different datasets. The findings suggest that metric choice must be validated on the target dataset rather than relying on others. A prompt-based LLM judge avoids some pitfalls but is costlier and non-deterministic.
Key Points
- Eight automatic scorers audited across three evaluation constructs show instability.
- No scorer maintains 95% confidence interval across multi-dataset constructs.
- Kendall tau correlation of -0.64 indicates significant metric ranking inversion.
- A naive 'best-on-average' rule results in higher regret than fixing one scorer.
- Prompt-based LLM judges are ~100x costlier and non-deterministic.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 23915v1 Announce Type: new Abstract: Practice often treats automatic metrics for attribution in LLM as interchangeable.
We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models (clean and FEVER NLI, the checker MiniCheck) -- across three evaluation constructs (provenance/topicality, generated-answer attribution, and fact-check entailment), asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct.
In the construct with the most multi-dataset human-labeled coverage -- generated-answer attribution (AttributionBench's four source datasets, n = 1,610, with independent HAGRID, n = 2,150) -- none does: the per-dataset metric rankings invert (Kendall tau = -0. 64, p = 0. 031 on AttributedQA vs. LFQA), and an off-the-shelf NLI scorer that is best on short-claim AttributedQA (AUROC 0. 90) collapses to AUROC 0. 53 (chance) on long-form LFQA, where BERTScore wins (0.
91); the flip is not a length or truncation artifact. This instability has a concrete decision cost: a naive "best-on-average" rule for choosing an evaluator fails leave-one-dataset-out (mean held-out regret 0. 172 AUROC, worse than fixing one scorer), so metric choice must be validated on the target dataset rather than learned from others.
A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer (no LFQA collapse) but is not uniformly best, ~100x costlier, and non-deterministic -- relocating, not removing, the validation burden.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.