S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering
Quick Take
S3MEM introduces a structured episodic memory framework that enhances long-horizon interactive question answering by improving evidence retrieval and efficiency. Evaluated across four environments, it consistently outperforms Vanilla RAG and other baselines, demonstrating superior accuracy with fewer evidence tokens. This advancement addresses the limitations of traditional memory interfaces in handling complex queries.
Key Points
- S3MEM uses structured memory units for better trajectory storage and retrieval.
- It outperforms Vanilla RAG across Crafter, Jericho, SciWorld, and ALFWorld environments.
- S3MEM achieves higher accuracy with fewer evidence tokens compared to competitors.
- Anchor-sensitive retrieval enhances the quality of evidence for answering questions.
- Three adapted baselines improve over Vanilla RAG but do not match S3MEM's performance.
Article Content
From source RSS / original summaryarXiv:2605. 28831v1 Announce Type: new Abstract: Long-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably. We argue that the main bottleneck is not context length alone, but the trajectory-to-answer interface of long-term memory.
When histories are stored as plain-text chunks and queried with standard retrieval-augmented generation (RAG), systems often retrieve locally relevant but chain-incomplete evidence, especially for spatial, temporal, repeated-event, and multi-hop state questions. We propose S3MEM, a structured scene-event episodic memory framework for long-horizon interactive question answering (QA).
S3MEM writes trajectories into structured memory units, retrieves evidence through anchor-sensitive retrieval, and exposes a compact token-budget-aware evidence interface for answer-time inference. In this sense, S3MEM is a structured evidence harness that converts agent trajectories into query-aligned support. We evaluate S3MEM on two internal headline environments (Crafter, Jericho) and two out-of-family environments (SciWorld, ALFWorld).
Under a shared frozen answer-time protocol, S3MEM consistently outperforms Vanilla RAG across all four environments, surpasses Graph-NoReader on Crafter, Jericho, and ALFWorld, and matches it on SciWorld while using dramatically fewer evidence tokens. Three adapted recent baselines -- A-MEM-inspired, MemoryOS-adapted, and LightMem-adapted -- improve over Vanilla RAG in several settings, but none matches S3MEM's overall accuracy-efficiency frontier.
Overall, the evidence supports a bounded conclusion: under the current frozen answer-time protocol, structured writing and anchor-sensitive evidence routing provide a stronger accuracy-efficiency frontier for long-horizon interactive QA than more generic memory interfaces.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.