LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations
Quick Answer
LANTERN is a lightweight memory layer that archives conversation turns, recovering 78.3% of lost facts without LLM calls, outperforming MemGPT's extraction pipeline.
Quick Take
LANTERN is a lightweight memory layer that archives conversation turns, recovering 78.3% of lost facts without LLM calls, outperforming MemGPT's extraction pipeline. It improves accuracy by 8.4 percentage points across four LLMs using restored context, demonstrating significant efficiency and utility in long-context conversations.
Key Points
- LANTERN recovers 78.3% of verifiable facts lost due to context compaction.
- It adds less than 25ms latency per turn and requires zero LLM calls.
- Base LANTERN matches or exceeds LLM-driven baselines with p=0.005 significance.
- Accuracy improves by 8.4 percentage points across four production LLMs.
- Full evaluation framework released for reproducibility and future research.
Article Content
From source RSS / original summaryarXiv:2606. 05182v1 Announce Type: new Abstract: Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval -- requiring zero LLM calls and adding fewer than 25ms of latency per turn.
On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0. 81), LANTERN-Rerank recovers 78. 3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT's LLM-driven extraction and multi-query search pipeline (72. 4%; Wilcoxon p<0. 0001, 95% CI [+3. 1, +8. 6] pp, d=0. 43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0. 005) using zero LLM calls.
When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8. 4 percentage points on average (Wilcoxon p<0. 05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework -- paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis -- to support reproducibility and future work.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.