When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

arXiv cs.CL·Nabaraj Subedi, Ahmed Abdelaty, Shivanand Venkanna Sheshappanavar

6/11/2026

·~1 min·6/11/2026·en·1

Quick Answer

The study introduces MASDR-RAG, addressing vector search dilution in retrieval-augmented generation (RAG) by using domain-scoped metadata, improving P@10 from 0.77 to 0.86 across various LLMs and datasets.

Quick Take

This method mitigates accuracy loss when scaling document collections, as demonstrated in a Wyoming DOT corpus, where accuracy dropped from 75% to below 40% when increasing documents from 54 to 1,128. The findings suggest prioritizing domain scoping before synthesis calls.

Key Points

MASDR- improves retrieval accuracy in large document collections by using domain-scoped metadata.
Accuracy dropped from 75% to below 40% when scaling from 54 to 1,128 documents.
P@10 improved significantly from 0.77 to 0.86 with domain scoping ($p < 0.05$).
orchestration revealed a precision-faithfulness paradox due to configuration dependence.
Code and data will be publicly available upon acceptance of the paper.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 11350v1 Announce Type: new Abstract: degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution.

Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

5d ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis