X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

arXiv cs.CL·Yongqi Kang, Yu Fu, Yong Zhao

1d ago

·~2 min·6/12/2026·en·0

Quick Answer

This paper shows that X-MADAM-RAG effectively diagnoses and manages evidence conflicts in multilingual retrieval-augmented generation systems, achieving 0.9667 strict accuracy on the X-RAMDocs-ZHEN benchmark.

Quick Take

X-MADAM- effectively diagnoses and manages evidence conflicts in multilingual retrieval-augmented generation systems, achieving 0.9667 strict accuracy on the X-RAMDocs-ZHEN benchmark. Despite outperforming a baseline, it struggles under stress tests, indicating document-level extraction as a bottleneck.

Key Points

X-RAMDocs-ZHEN benchmark includes 300 examples across six balanced conditions.
X-MADAM-RAG achieves 0.9667 accuracy and 0.9767 conflict-aware success with Qwen2.5-7B-Instruct.
A zero-call rule-only extractor scores 1.0000, highlighting template regularity issues.
Under stress tests, X-MADAM-RAG drops to 0.3000 strict accuracy, revealing extraction limitations.
Findings suggest X-RAMDocs-ZHEN and X-MADAM-RAG are diagnostic tools for evidence conflict.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12903v1 Announce Type: new Abstract: (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG.

The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2. 5-7B-Instruct, X-MADAM-RAG achieves 0.

9667 strict accuracy and 0. 9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1. 0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0. 0000, but X-MADAM-RAG also drops to 0.

3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy