MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A
Quick Take
MM-BizRAG enhances multimodal retrieval-augmented generation by explicitly extracting document structure, outperforming state-of-the-art models by up to 32% on benchmarks like SlideVQA and FinRAGBench-V. Its novel approach allows for richer answers without fine-tuning, while FastRAGEval reduces evaluation costs significantly.
Key Points
- MM-BizRAG uses document structure-aware processing for improved retrieval and generation.
- Achieves up to 32% better performance than existing vision-centric models.
- Demonstrated strong results on report-style layouts in enterprise documents.
- FastRAGEval metric reduces evaluation costs by half while enhancing human alignment.
- Utilizes a unified LLM-driven pipeline for efficient document handling.
Article Content
From source RSS / original summaryarXiv:2606. 04231v1 Announce Type: new Abstract: Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure.
In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e. g. , reports) and holistic page-level representations for horizontally structured documents (e. g. , slide decks).
A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement.
Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker's cost while achieving stronger human alignment.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.