Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
Quick Answer
This paper shows that STATEWITNESS, an activation explainer for deception auditing in reasoning LLMs, achieves a 0.916 mean AUROC across seven datasets, outperforming existing monitors by up to 25%.
Quick Take
STATEWITNESS, an activation explainer for deception auditing in reasoning LLMs, achieves a 0.916 mean AUROC across seven datasets, outperforming existing monitors by up to 25%. It provides detailed insights into suspicious responses, enhancing interpretability and alignment tools for AI safety.
Key Points
- STATEWITNESS improves deception detection with 11.6% gain over the best black-box text monitor.
- The model provides query-level answers and structured reports for human inspection.
- It reduces missed deceptive examples when combined with existing monitors.
- Evaluated on two reasoning LLMs across seven deception datasets.
- Potential building block for broader interpretability and alignment tools.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing.
A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0. 916 mean AUROC, a relative gain of 11. 6% over the best black-box text monitor and 25. 0% over the best activation-probe baseline under the same evaluation protocol.
When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

