FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing
Quick Take
FAB-Bench introduces an adaptive benchmarking framework for Retrieval-Augmented Generation (RAG) in semiconductor manufacturing, defining six diagnostic metrics and revealing distinct context-scaling behaviors across four LLMs. The framework's evaluation highlights attention dilution as a key factor in performance degradation at extreme context lengths.
Key Points
- FAB-Bench defines six metrics: factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, reasoning consistency.
- Evaluates RAG systems across four LLMs, revealing logarithmic growth and cold-start dynamics.
- Curated a benchmark of 200 query-answer pairs from over 1,300 generated candidates.
- Identifies attention dilution as a primary cause of performance degradation at extreme context lengths.
- Cross-framework validation confirms evaluation portability across three additional production RAG systems.
Article Content
From source RSS / original summaryarXiv:2605. 26476v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing.
FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands.
From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths.
Cross-framework validation on three additional production RAG systems confirms evaluation portability.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.


