Safeguarding LLM Agents from Misalignment through Provenance Analysis

arXiv cs.CL·Yining She, Yiliang Liang, Eunsuk Kang

3h ago

·~2 min·7/3/2026·en·0

Quick Answer

This paper shows that ProvenanceGuard, a new framework for LLM agents, reduces misalignment error rates from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, enhancing alignment with user intent through structured provenance analysis.

Quick Take

ProvenanceGuard, a new framework for LLM agents, reduces misalignment error rates from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, enhancing alignment with user intent through structured provenance analysis.

Key Points

ProvenanceGuard analyzes agent actions for misalignment before tool execution.
Error rates on misaligned traces dropped significantly across multiple benchmarks.
Intervention burden on successful task traces reduced from 30.5% to 12.8%.
No significant increase in unnecessary interventions on aligned traces.
Framework leverages provenance analysis for effective alignment detection.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2607. 01236v1 Announce Type: new Abstract: As LLM agents gain increasing access to powerful tools, ensuring that their actions are aligned with the user's intent becomes critical. When an agent's proposed tool invocation deviates from the user's intent -- a phenomenon called misalignment -- it may lead to harmful consequences that are difficult to undo.

Existing runtime guardrails rely on an LLM-as-a-judge paradigm that lacks a systematic framework for reasoning about alignment, often producing judgments that are inconsistent or difficult to audit. Motivated by provenance analysis, we propose a provenance-based conceptual framework that formalizes misalignment detection as determining whether a proposed tool call is supported by traceable evidence in the agent's context.

Building on this framework, we propose ProvenanceGuard, a multi-stage pipeline that analyzes the agent's action for three types of misalignment before the selected tool is executed and only allows the action to take place when it is considered aligned with the user's input query. We evaluated our proposed approach on two different benchmarks, Agent-SafetyBench and WorkBench, across 10 backbone LLMs. Compared to the LLM-as-a-judge baseline, ProvenanceGuard reduces error rate on misaligned traces from 42. 9% to 1.

8% on Agent-SafetyBench and from 32. 1% to 17. 3% on WorkBench, while reducing intervention burden on task-successful traces from 30. 5% to 12. 8% and introducing no statistically significant increase in unnecessary interventions on aligned traces. These results demonstrate that structured, provenance-based reasoning provides an effective and practical foundation for safeguarding LLM agents from misalignment.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Safeguarding LLM Agents from Misalignment through Provenance Analysis

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems