Structure-Aware RAG: Structured Retrieval Augmented Generation from Noisy Data for Conversational Agents
Quick Take
The Structure-aware Retrieval Augmented Generation (SA-RAG) model enhances conversational agents by using structured tables to mitigate noise in data retrieval, outperforming existing RAG methods on noisy datasets. It introduces a quality-aware metadata framework that improves both metadata quality and downstream performance, demonstrating significant advancements in real-world applications.
Key Points
- SA-RAG uses tables for structured representation, reducing noise in data retrieval.
- Introduces a quality-aware metadata generation framework for improved performance.
- Outperforms existing RAG methods on two noisy real-world datasets.
- Explores both training-free and training-based table generation techniques.
- Code is publicly available for further research and application.
Article Content
From source RSS / original summaryarXiv:2605. 24366v1 Announce Type: new Abstract: Large Language Models (LLMs) have been widely adopted in conversational applications. However, their reliance on parametric knowledge limits reliability in real-world scenarios that require dynamic or domain-specific information. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge during generation, but existing text-based and graph-based RAG methods often struggle with noisy or irrelevant contexts.
In this work, we propose Structure-aware Retrieval Augmented Generation (SA-RAG), which uses tables as an intermediate structured representation to provide a compact and controllable interface that reduces noise while preserving essential information. We introduce a quality-aware table metadata generation framework that models metadata normalization and effectiveness, improving metadata quality and downstream performance. Furthermore, we explore both training-free and training-based table generation methods.
Generation validation and direct preference optimization further improve table quality while maintaining semantic and structural consistency. Experiments on two noisy real-world datasets show that SA-RAG significantly outperforms existing RAG baselines. Our code is publicly available at a public repository.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.