Auditing Agent Harness Safety
Quick Answer
HarnessAudit introduces a framework to audit execution trajectories of LLM agents, revealing that task completion often misaligns with safe execution.
Quick Take
HarnessAudit introduces a framework to audit execution trajectories of LLM agents, revealing that task completion often misaligns with safe execution. Evaluations show safety risks accumulate with trajectory length and vary across domains, highlighting critical violations in resource access and inter-agent information transfer.
Key Points
- HarnessAudit audits compliance, fidelity, and stability in environments.
- 210 tasks across eight domains reveal safety risks vary by task type and agent roles.
- Most violations occur in resource access and inter-agent information transfer.
- Task completion does not guarantee safe execution; risks increase with trajectory length.
- Harness design significantly influences the upper limit of safe deployment.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2605. 14271v1 Announce Type: new Abstract: LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent.
Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution.
To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints.
Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.


