"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Quick Answer
This study evaluates lie detectors across 31 models with 2B to 1T parameters, revealing that existing detectors struggle with trained model organisms.
Quick Take
This study evaluates lie detectors across 31 models with 2B to 1T parameters, revealing that existing detectors struggle with trained model organisms. The chain-of-thought judge outperforms others with a balanced accuracy of 0.82, while new methods like Did-You-Lie (DYL) retain more signal. Current detectors cannot confidently assert model beliefs, indicating a need for further research.
Key Points
- Evaluated 13 reasoning model organisms with verified hidden beliefs.
- Four detectors tested: chain-of-thought judge, logprob classifier, and two activation probes.
- DYL method shows improved performance on prompted lying tasks.
- Chain-of-thought judge achieved 0.82 balanced accuracy on trained organisms.
- Current lie detectors are insufficient for high-confidence claims about model beliefs.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12618v1 Announce Type: new Abstract: Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret.
We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes.
On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0. 82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs.
Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.