Safeguarding LLM Agents from Misalignment through Provenance Analysis
Quick Answer
This paper shows that ProvenanceGuard, a new framework for LLM agents, reduces misalignment error rates from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, enhancing alignment with user intent through structured provenance analysis.
Quick Take
ProvenanceGuard, a new framework for LLM agents, reduces misalignment error rates from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, enhancing alignment with user intent through structured provenance analysis.
Key Points
- ProvenanceGuard analyzes agent actions for misalignment before tool execution.
- Error rates on misaligned traces dropped significantly across multiple benchmarks.
- Intervention burden on successful task traces reduced from 30.5% to 12.8%.
- No significant increase in unnecessary interventions on aligned traces.
- Framework leverages provenance analysis for effective alignment detection.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01236v1 Announce Type: new Abstract: As LLM agents gain increasing access to powerful tools, ensuring that their actions are aligned with the user's intent becomes critical. When an agent's proposed tool invocation deviates from the user's intent -- a phenomenon called misalignment -- it may lead to harmful consequences that are difficult to undo.
Existing runtime guardrails rely on an LLM-as-a-judge paradigm that lacks a systematic framework for reasoning about alignment, often producing judgments that are inconsistent or difficult to audit. Motivated by provenance analysis, we propose a provenance-based conceptual framework that formalizes misalignment detection as determining whether a proposed tool call is supported by traceable evidence in the agent's context.
Building on this framework, we propose ProvenanceGuard, a multi-stage pipeline that analyzes the agent's action for three types of misalignment before the selected tool is executed and only allows the action to take place when it is considered aligned with the user's input query. We evaluated our proposed approach on two different benchmarks, Agent-SafetyBench and WorkBench, across 10 backbone LLMs. Compared to the LLM-as-a-judge baseline, ProvenanceGuard reduces error rate on misaligned traces from 42. 9% to 1.
8% on Agent-SafetyBench and from 32. 1% to 17. 3% on WorkBench, while reducing intervention burden on task-successful traces from 30. 5% to 12. 8% and introducing no statistically significant increase in unnecessary interventions on aligned traces. These results demonstrate that structured, provenance-based reasoning provides an effective and practical foundation for safeguarding LLM agents from misalignment.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.