How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation
Quick Answer
This paper shows that The HieraRAG framework optimizes retrieval-augmented generation (RAG) benchmarks by defining granularity levels that maximize discriminative power.
Quick Take
The HieraRAG framework optimizes (RAG) benchmarks by defining granularity levels that maximize discriminative power. In a study, 5,872 synthetic QA pairs were generated, revealing that question complexity benefits from fine granularity, while answer type and linguistic variation perform best at medium granularity. This framework aids practitioners in evaluating their RAG systems effectively.
Key Points
- HieraRAG defines optimal granularity for RAG benchmarks based on discriminative power.
- 5,872 synthetic QA pairs were generated across three dimensions and three granularity levels.
- Question complexity benefits from fine-grained distinctions, achieving a discriminative power of 0.053.
- A new Coherence Ratio metric reveals structural differences in question characteristics.
- Human evaluation confirms the quality of the synthetic QA pairs generated.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12789v1 Announce Type: new Abstract: Evaluating (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity.
We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories).
With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0. 053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0. 40 vs. Answer Type: 1. 44). Human evaluation of 110 stratified QA pairs confirms synthetic quality.
While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.