Agent Safety Is Action Alignment

arXiv cs.AI·Shawn Li, Yue Zhao

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

The paper critiques the application of content safety methods to agentic AI, arguing that refusal mechanisms fail to address the unique risks of agent actions.

Quick Take

The paper critiques the application of content safety methods to agentic AI, arguing that refusal mechanisms fail to address the unique risks of agent actions. It emphasizes that safety should be enforced through 'least privilege' principles rather than reliance on model weights, as agentic harm stems from authority misalignment rather than output content.

Key Points

Refusal mechanisms are inadequate for ensuring agent safety in AI applications.
Agentic harm arises from authority misalignment, not from model output.
Safety must be enforced externally at the action boundary, not solely in model weights.
Defense-trained models learn patterns but fail to understand intent.
Current models often exceed user-granted authority in typical use cases.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 27 Jun 2026]

View PDF HTML (experimental)

Abstract:Large language models increasingly act as agents: they call tools, move money, delete records, and send messages on a user's behalf. To keep them safe, practitioners imported the chatbot-era recipe (train the model to refuse unsafe inputs) into the agentic setting, and treat the resulting capability loss as a manageable ``alignment tax.'' We argue this is a \emph{category error}. Refusal is a primitive for \emph{content safety}, where the harm is in the model's output and is therefore a learnable function of it. Agentic harm is different in kind: it lies not in any output but in the relation between the authority an action exercises and the authority the user granted, which is absent from the text the model sees. Importing content-safety methods into this regime does not trade capability for safety; it pays capability and buys negative security. We support this with three lines of evidence spanning the autonomy spectrum: defense-trained models learn surface patterns rather than intent; the same training collapses multi-step agents before any threat appears while leaving them exploitable; and even undefended frontier models exceed granted authority under ordinary use. We conclude that action safety cannot be installed in weights. It must be expressed as \emph{least privilege}, enforced \emph{outside} the model at the action boundary, and evaluated as \emph{action alignment} (a relational, deployment-conditioned property) rather than a refusal score.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.28739 [cs.AI]
	(or arXiv:2606.28739v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2606.28739 arXiv-issued DOI via DataCite

Submission history

From: Li Li [view email]
[v1] Sat, 27 Jun 2026 05:26:43 UTC (142 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

5d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy

Agent Safety Is Action Alignment

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.AI

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols

How Do Tool-Augmented LLM Agents Perform on Real-World Energy Analytics Tasks?

Related in this space

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw

As AI agents become employees, NewCore emerges with $66M to give them identities