A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models
Quick Answer
This paper introduces a four-condition diagnostic protocol for evaluating evidence utilization in long-context and retrieval-augmented language models, revealing that failures differ by task type.
Quick Take
This paper introduces a four-condition diagnostic protocol for evaluating evidence utilization in long-context and retrieval-augmented language models, revealing that failures differ by task type. The study assesses models from Qwen, Gemma, Llama, and Mistral across various benchmarks, highlighting that controlled settings expose full-context failures while realistic settings reveal retrieval-chain issues.
Key Points
- Proposes a four-condition protocol: no evidence, full context, retrieved evidence, oracle evidence.
- Evaluates five models from Qwen, Gemma, Llama, and Mistral across 18,000 predictions.
- Finds controlled settings expose full-context utilization failures.
- Realistic multi-hop settings reveal retrieval-chain coverage failures.
- Focuses on separating different types of evidence utilization rather than a single-score leaderboard.
Article Content
From source RSS / original summaryarXiv:2606. 06758v1 Announce Type: new Abstract: Final-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long-context or retrieval-augmented language model used the evidence it was given. A model can answer from parametric memory, fail despite receiving the right passages, or cite evidence without converting it into the requested answer.
This paper proposes a matched four-condition evidence-availability protocol--no evidence, full context, retrieved evidence, and oracle-evidence reference--for diagnosing evidence utilization under fixed examples, prompts, score fields, retrieval settings, and validity checks. ONCU is used as a protocol-bound estimator of recovered oracle-reference evidence advantage and is computed only for denominator-valid groups; denominator-free answer, evidence, retrieval, and failure-audit metrics are reported separately.
The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families across Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, with 18,000 ONCU-compatible predictions.
The main finding is a task-dependent bottleneck split: controlled synthetic settings primarily expose full-context utilization failures, whereas the tested realistic multi-hop settings primarily expose retrieval-chain coverage failures in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups.
The contribution is a diagnostic protocol for separating no-evidence answerability, oracle-evidence recoverability, full-context utilization, and retrieval-conditioned utilization, rather than a single-score leaderboard for long-context or retrieval-augmented systems.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.