Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection
Quick Answer
This paper shows that The Evidence Graph Consistency (EGC) framework identifies hallucination patterns in Retrieval-Augmented Generation (RAG) models, revealing a model-family split: while Llama-2 shows expected graph consistency with hallucinations, GPT-4, GPT-3.5, and Mistral-7B exhibit a reversal, indicating distinct hallucination characteristics across models.
Quick Take
The Evidence Graph Consistency (EGC) framework identifies hallucination patterns in (RAG) models, revealing a model-family split: while Llama-2 shows expected graph consistency with hallucinations, GPT-4, GPT-3.5, and Mistral-7B exhibit a reversal, indicating distinct hallucination characteristics across models. This suggests that embedding-based consistency cannot serve as a universal detection method.
Key Points
- EGC computes five structural consistency measures as hallucination indicators.
- Evaluated on RAGTruth with 5,767 responses across six LLMs.
- Llama-2 models show expected graph consistency; others do not.
- Reversal in GPT-4 and GPT-3.5 indicates different hallucination patterns.
- Embedding-based graph consistency is not a model-independent detection signal.
Article Excerpt
From source RSS / original summaryarXiv:2606. 06748v1 Announce Type: new Abstract: (RAG) reduces but does not eliminate hallucination in large language models. Existing detection methods rely on flat similarity between generated answers and retrieved passages, ignoring structural relationships among evidence pieces and answer claims. We propose Evidence Graph Consistency (EGC), a framework that constructs a local evidence graph per response and computes five structural consistency measures as hallucination indicators.
Evaluated on the full question answering split of RAGTruth across six LLMs (5,767 responses), EGC reveals a consistent model-family split: graph consistency features show the expected diagnostic direction for hallucinations in Llama-2 models but exhibit systematic reversal in GPT-4, GPT-3. 5, and Mistral-7B.
This reversal suggests qualitatively different hallucination patterns across model families and indicates that embedding-based graph consistency cannot serve as a model-independent hallucination detection signal.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.