MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents
Quick Answer
MIRAGE introduces a dual-channel monitoring system for LLM agents, achieving AUC = 0.918 in detecting covert data encoding across various models.
Quick Take
MIRAGE introduces a dual-channel monitoring system for LLM agents, achieving AUC = 0.918 in detecting covert data encoding across various models. It exploits a low-dimensional encoding subspace, outperforming traditional output-only detection methods significantly, with false-positive rates varying from 0% to 100% depending on the model's geometry.
Key Points
- MIRAGE monitors covert encoding in LLMs, achieving AUC = 0.918 on 126 scenarios.
- Logistic-regression probe reads computation, not surface features, recovering encoding families effectively.
- False-positive rates for benign encoding vary significantly based on model geometry.
- Encoding fidelity is compromised when adversarial attacks suppress the encoding subspace.
- The system distinguishes between inline encoding and tool call outsourcing strategies.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10304v1 Announce Type: new Abstract: When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream.
A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0. 975-1. 000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists.
We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0. 918, substantially outperforming output-only detection (AUC = 0. 518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.
5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.
