MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning
Quick Take
The MemoryDocDataSet introduces a benchmark for evaluating AI's ability to navigate multi-session conversations and comprehend long documents, featuring 1,000 QA pairs across 50 micro-worlds. The best baseline model, RAG-Both, achieves an overall F1 score of 0.358, highlighting a significant gap in joint retrieval capabilities. The dataset and implementations are publicly released for further research.
Key Points
- MemoryDocDataSet includes 50 micro-worlds and 1,000 QA pairs.
- Hybrid questions, requiring both conversation history and document retrieval, make up 75.1% of the dataset.
- RAG-Both model achieves an overall F1 score of 0.358.
- Document-only retrieval (RAG-Doc) scores only 0.267 on Hybrid questions.
- The dataset and generation pipeline are publicly available for research.
Article Content
From source RSS / original summaryarXiv:2606. 04442v1 Announce Type: new Abstract: AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously.
We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories.
The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75. 1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $\kappa = 0. 634$ across all 50 micro-worlds.
We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0. 358 overall F1 and 0. 342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0. 267 on Hybrid despite achieving 0. 453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation.
We release the dataset, generation pipeline, and all baseline implementations.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.