Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Quick Answer
Engram is an open-source bi-temporal memory engine that improves LLM accuracy by utilizing a lean context approach, achieving 83.6% on LongMemEval_S with only 9.6k tokens compared to 73.2% for full-context at 79k tokens, while maintaining provenance and reducing costs.
Quick Take
Engram is an open-source bi-temporal memory engine that improves LLM accuracy by utilizing a lean context approach, achieving 83.6% on LongMemEval_S with only 9.6k tokens compared to 73.2% for full-context at 79k tokens, while maintaining provenance and reducing costs.
Key Points
- Engram appends lossless episodes without LLM on the critical path.
- Achieves 83.6% accuracy on LongMemEval_S with 9.6k tokens.
- Hybrid read path combines facts and retrieved chunks for better recall.
- Evaluates memory benchmarks with a neutral, in-repo harness.
- Addresses measurement-integrity pitfalls in memory systems.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 09900v1 Announce Type: new Abstract: Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources.
We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain.
A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9. 6k-token retrieved slice, never the full history -- scores 83. 6% vs. 73. 2% for full-context (+10. 4 points, McNemar p < 10^-6) at ~8x fewer tokens (9. 6k vs. 79k), with 0/500 errored.
The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.