Hallucination as Exploit: Evidence-Carrying Multimodal Agents

arXiv cs.AI·Guijia Zhang, Hao Zheng, Harry Yang

17h ago

·~2 min·5/20/2026·en·2

Quick Take

ECA agents mitigate hallucination risks by requiring external evidence for privileged actions.

Key Points

Hallucination triggers authorization failures in multimodal agents.
ECA uses typed certificates to validate tool calls.
Achieves 0% unsafe-action rate in extensive testing.

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF HTML (experimental)

Abstract:Multimodal agents use screenshots, documents, and webpages to choose tool calls. When a false visual claim triggers a click, email, extraction, or transfer, hallucination becomes an authorization failure rather than an answer-quality error. We formalize this failure mode as hallucination-to-action conversion: an unsupported perceptual claim supplies the precondition that makes a privileged action appear permitted. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence. ECA decomposes each tool call into action-critical predicates, obtains typed certificates from constrained DOM/OCR/AX verifiers, and lets a deterministic gate grant only the privileges those certificates support. The architecture does not hide perception error; it converts opaque model belief into named verifier, schema, and implementation residuals. Verifier red-teaming over 1,900 attacks exposes this residual directly: four targeted hardening steps reduce gate bypass from 15% to 1.3%. With content-derived certificates, ECA obtains 0% unsafe-action rate on a 200-task end-to-end pipeline (Wilson 95% upper bound 2.67%) and a 120-task browser proof-of-concept (upper bound 4.3%). A direct HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defense (49.6%), but not for ECA. Oracle-certificate replay on 7,488 GPT-5.4 benchmark traces serves as a gate-correctness sanity check, and neural judge baselines remain bypassable under the same threat model. The resulting principle is simple: model language may propose actions, but external evidence must authorize them.

Comments:	21 pages, 6 figures, 13 tables
Subjects:	Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Cite as:	arXiv:2605.19192 [cs.AI]
	(or arXiv:2605.19192v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.19192 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Guijia Zhang [view email]
[v1] Mon, 18 May 2026 23:40:43 UTC (1,264 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.AI

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Related in this space

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

State Contamination in Memory-Augmented LLM Agents