Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

arXiv cs.CL·Tianyu Ding, Aditya Nannapaneni, Juan Pablo De la Cruz Weinstein

4h ago

·~2 min·6/24/2026·en·0

Quick Answer

This study reveals that automatic metrics for LLM retrieval-augmented generation do not transfer reliably across datasets, with significant performance variations; for instance, an NLI scorer drops from AUROC 0.90 to 0.53 on different datasets.

Quick Take

This study reveals that automatic metrics for LLM do not transfer reliably across datasets, with significant performance variations; for instance, an NLI scorer drops from AUROC 0.90 to 0.53 on different datasets. The findings suggest that metric choice must be validated on the target dataset rather than relying on others. A prompt-based LLM judge avoids some pitfalls but is costlier and non-deterministic.

Key Points

Eight automatic scorers audited across three evaluation constructs show instability.
No scorer maintains 95% confidence interval across multi-dataset constructs.
Kendall tau correlation of -0.64 indicates significant metric ranking inversion.
A naive 'best-on-average' rule results in higher regret than fixing one scorer.
Prompt-based LLM judges are ~100x costlier and non-deterministic.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 23915v1 Announce Type: new Abstract: Practice often treats automatic metrics for attribution in LLM as interchangeable.

We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models (clean and FEVER NLI, the checker MiniCheck) -- across three evaluation constructs (provenance/topicality, generated-answer attribution, and fact-check entailment), asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct.

In the construct with the most multi-dataset human-labeled coverage -- generated-answer attribution (AttributionBench's four source datasets, n = 1,610, with independent HAGRID, n = 2,150) -- none does: the per-dataset metric rankings invert (Kendall tau = -0. 64, p = 0. 031 on AttributedQA vs. LFQA), and an off-the-shelf NLI scorer that is best on short-claim AttributedQA (AUROC 0. 90) collapses to AUROC 0. 53 (chance) on long-form LFQA, where BERTScore wins (0.

91); the flip is not a length or truncation artifact. This instability has a concrete decision cost: a naive "best-on-average" rule for choosing an evaluator fails leave-one-dataset-out (mean held-out regret 0. 172 AUROC, worse than fixing one scorer), so metric choice must be validated on the target dataset rather than learned from others.

A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer (no LFQA collapse) but is not uniformly best, ~100x costlier, and non-deterministic -- relocating, not removing, the validation burden.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

4h ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems