TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling
Quick Take
TRACES introduces a proactive safety auditing framework for multi-turn LLM agents, enhancing risk detection during trajectory modeling. By utilizing weak trajectory-level supervision, it achieves improved safety predictions across benchmarks, indicating a potential for training safer agents.
Key Points
- TRACES learns trajectory risk states from hidden representations of an observer LLM.
- It uses weak trajectory-level supervision to avoid costly step-level risk annotation.
- The framework improves full-trajectory safety prediction across multiple benchmarks.
- Proactive risk discrimination is enhanced, indicating better safety management.
- Risk states identified can assist in training safer LLM agents.
Article Excerpt
From source RSS / original summaryarXiv:2605. 27690v1 Announce Type: new Abstract: LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding.
We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates.
Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.
