AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

arXiv cs.CL·Jihyung Park, Saleh Afroogh, Junfeng Jiao

4d ago

·~2 min·5/26/2026·en·1

Quick Take

AERIC introduces anticipatory hidden-state monitoring to detect implicit harmful dialogue in language models, improving AUROC scores significantly on DiaSafety and Harmful Advice benchmarks. The model enhances safety without substantial latency increase, outperforming Qwen3GuardStream-4B while maintaining a low safe-trigger rate.

Key Points

AERIC improves AUROC from 0.6830 to 0.7143 on DiaSafety.
On Harmful Advice, AUROC increases from 0.8219 to 0.8582.
The model only requires 387 trainable head parameters.
Latency increase is just 2.34% compared to 79.40% for Qwen3GuardStream-4B.
AERIC maintains a safe-trigger rate of at most 10%.

Article Content

From source RSS / original summary

arXiv:2605. 23974v1 Announce Type: new Abstract: Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text.

Existing response-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator's own internal trajectory. We study anticipatory same-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model.

We introduce AERIC, a transfer-oriented hidden-state approach for implicit harmful dialogue that combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring under a same-pass exponential moving average decision rule. The default linear monitor contains only 387 trainable head parameters. Against Qwen3GuardStream-4B on balanced benchmarks, AERIC improves AUROC from 0. 6830 to 0. 7143 on DiaSafety and from 0. 8219 to 0. 8582 on Harmful Advice.

For promptlevel trigger benchmarks, we calibrate the AERIC threshold by a source-side safe-budget rule that maximizes trigger coverage while constraining the safe-trigger rate to at most 10%. Under that rule, trigger@64 reaches 0. 6438 and 0. 4656 on HarmBench DirectRequest and 0. 6849 and 0. 7363 on SocialHarmBench for Qwen and Gemma, respectively, withholding between 23. 53 and 41. 86 answer tokens on average.

Same-pass deployment is also efficient: on a 63-prompt harmfulprompt fixed-generation benchmark aggregated over HarmBench DirectRequest and SocialHarmBench under Qwen3-8B, the monitor increases mean latency by only 2. 34%, whereas Qwen3Guard-Stream-4B increases it by 79. 40%.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

Related in this space

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

FORT Robotics Acquires Mapless AI to Expand Its Trust Platform with Remote Supervision and Active Safety Capabilities