LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

arXiv cs.CL·Rahul Subramani

2d ago

·~1 min·6/5/2026·en·1

Quick Answer

LANTERN is a lightweight memory layer that archives conversation turns, recovering 78.3% of lost facts without LLM calls, outperforming MemGPT's extraction pipeline.

Quick Take

LANTERN is a lightweight memory layer that archives conversation turns, recovering 78.3% of lost facts without LLM calls, outperforming MemGPT's extraction pipeline. It improves accuracy by 8.4 percentage points across four LLMs using restored context, demonstrating significant efficiency and utility in long-context conversations.

Key Points

LANTERN recovers 78.3% of verifiable facts lost due to context compaction.
It adds less than 25ms latency per turn and requires zero LLM calls.
Base LANTERN matches or exceeds LLM-driven baselines with p=0.005 significance.
Accuracy improves by 8.4 percentage points across four production LLMs.
Full evaluation framework released for reproducibility and future research.

Article Content

From source RSS / original summary

arXiv:2606. 05182v1 Announce Type: new Abstract: Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval -- requiring zero LLM calls and adding fewer than 25ms of latency per turn.

On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0. 81), LANTERN-Rerank recovers 78. 3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT's LLM-driven extraction and multi-query search pipeline (72. 4%; Wilcoxon p<0. 0001, 95% CI [+3. 1, +8. 6] pp, d=0. 43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0. 005) using zero LLM calls.

When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8. 4 percentage points on average (Wilcoxon p<0. 05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework -- paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis -- to support reproducibility and future work.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy