Hallucination as Exploit: Evidence-Carrying Multimodal Agents
Quick Take
ECA agents mitigate hallucination risks by requiring external evidence for privileged actions.
Key Points
- Hallucination triggers authorization failures in multimodal agents.
- ECA uses typed certificates to validate tool calls.
- Achieves 0% unsafe-action rate in extensive testing.
📖 Reader Mode
~2 min readAbstract:Multimodal agents use screenshots, documents, and webpages to choose tool calls. When a false visual claim triggers a click, email, extraction, or transfer, hallucination becomes an authorization failure rather than an answer-quality error. We formalize this failure mode as hallucination-to-action conversion: an unsupported perceptual claim supplies the precondition that makes a privileged action appear permitted. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence. ECA decomposes each tool call into action-critical predicates, obtains typed certificates from constrained DOM/OCR/AX verifiers, and lets a deterministic gate grant only the privileges those certificates support. The architecture does not hide perception error; it converts opaque model belief into named verifier, schema, and implementation residuals. Verifier red-teaming over 1,900 attacks exposes this residual directly: four targeted hardening steps reduce gate bypass from 15% to 1.3%. With content-derived certificates, ECA obtains 0% unsafe-action rate on a 200-task end-to-end pipeline (Wilson 95% upper bound 2.67%) and a 120-task browser proof-of-concept (upper bound 4.3%). A direct HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defense (49.6%), but not for ECA. Oracle-certificate replay on 7,488 GPT-5.4 benchmark traces serves as a gate-correctness sanity check, and neural judge baselines remain bypassable under the same threat model. The resulting principle is simple: model language may propose actions, but external evidence must authorize them.
| Comments: | 21 pages, 6 figures, 13 tables |
| Subjects: | Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) |
| Cite as: | arXiv:2605.19192 [cs.AI] |
| (or arXiv:2605.19192v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19192 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Guijia Zhang [view email]
[v1]
Mon, 18 May 2026 23:40:43 UTC (1,264 KB)
— Originally published at arxiv.org
More from arXiv cs.AI
See more →From Prompts to Protocols: An AI Agent for Laboratory Automation
An AI agent integrates large language models for automating laboratory protocols, enhancing efficiency and accuracy.