State Contamination in Memory-Augmented LLM Agents
Quick Answer
The study reveals that memory laundering in memory-augmented LLM agents can hide toxic influences in compressed memory summaries, leading to increased downstream toxicity.
Quick Take
The study reveals that memory laundering in memory-augmented LLM agents can hide toxic influences in compressed memory summaries, leading to increased downstream toxicity. The introduction of the sub-threshold propagation gap (SPG) quantifies this hidden influence, emphasizing the need for proactive sanitization of toxic states before they are summarized to ensure safety in AI interactions.
Key Points
- Memory laundering allows toxic context to evade detection in LLM agents.
- Sub-threshold propagation gap (SPG) measures hidden toxicity effects from memory states.
- Raw transcript reuse leads to overt toxicity, while compressed memory has hidden influences.
- Sanitizing toxic states before summarization significantly reduces hidden toxicity propagation.
- Safety in memory-augmented agents requires state-control over evolving contexts.
Paper Resources
📖 Reader Mode
~2 min readAbstract:LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.
| Subjects: | Artificial Intelligence (cs.AI); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.16746 [cs.AI] |
| (or arXiv:2605.16746v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16746 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yian Wang [view email]
[v1]
Sat, 16 May 2026 01:55:06 UTC (442 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.