AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue
Quick Take
AERIC introduces anticipatory hidden-state monitoring to detect implicit harmful dialogue in language models, improving AUROC scores significantly on DiaSafety and Harmful Advice benchmarks. The model enhances safety without substantial latency increase, outperforming Qwen3GuardStream-4B while maintaining a low safe-trigger rate.
Key Points
- AERIC improves AUROC from 0.6830 to 0.7143 on DiaSafety.
- On Harmful Advice, AUROC increases from 0.8219 to 0.8582.
- The model only requires 387 trainable head parameters.
- Latency increase is just 2.34% compared to 79.40% for Qwen3GuardStream-4B.
- AERIC maintains a safe-trigger rate of at most 10%.
Article Content
From source RSS / original summaryarXiv:2605. 23974v1 Announce Type: new Abstract: Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text.
Existing response-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator's own internal trajectory. We study anticipatory same-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model.
We introduce AERIC, a transfer-oriented hidden-state approach for implicit harmful dialogue that combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring under a same-pass exponential moving average decision rule. The default linear monitor contains only 387 trainable head parameters. Against Qwen3GuardStream-4B on balanced benchmarks, AERIC improves AUROC from 0. 6830 to 0. 7143 on DiaSafety and from 0. 8219 to 0. 8582 on Harmful Advice.
For promptlevel trigger benchmarks, we calibrate the AERIC threshold by a source-side safe-budget rule that maximizes trigger coverage while constraining the safe-trigger rate to at most 10%. Under that rule, trigger@64 reaches 0. 6438 and 0. 4656 on HarmBench DirectRequest and 0. 6849 and 0. 7363 on SocialHarmBench for Qwen and Gemma, respectively, withholding between 23. 53 and 41. 86 answer tokens on average.
Same-pass deployment is also efficient: on a 63-prompt harmfulprompt fixed-generation benchmark aggregated over HarmBench DirectRequest and SocialHarmBench under Qwen3-8B, the monitor increases mean latency by only 2. 34%, whereas Qwen3Guard-Stream-4B increases it by 79. 40%.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.
