"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

arXiv cs.AI·Alan Cooney, David Africa, Geoffrey Irving

6/12/2026

·~2 min·6/12/2026·en·2

Quick Answer

This study evaluates lie detectors across 31 models with 2B to 1T parameters, revealing that existing detectors struggle with trained model organisms.

Quick Take

The chain-of-thought judge outperforms others with a balanced accuracy of 0.82, while new methods like Did-You-Lie (DYL) retain more signal. Current detectors cannot confidently assert model beliefs, indicating a need for further research.

Key Points

Evaluated 13 reasoning model organisms with verified hidden beliefs.
Four detectors tested: chain-of-thought judge, logprob classifier, and two activation probes.
DYL method shows improved performance on prompted lying tasks.
Chain-of-thought judge achieved 0.82 balanced accuracy on trained organisms.
Current lie detectors are insufficient for high-confidence claims about model beliefs.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 12618v1 Announce Type: new Abstract: Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret.

We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Vinil Pasupuleti, Shyalendar Reddy Allala, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi, Srinivasateja Songa

4d ago

FeaturedOriginal

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

AI Summary

AINTMA, an autonomous test management architecture utilizing six specialized AI agents, achieves 88.4% test prioritization accuracy and reduces defect escape rates from 8.3% to 2.1%. The system demonstrates a 340% ROI within nine months, showcasing the potential of agentic AI in enhancing software quality management in cloud environments.

#Agent #AI Coding #Security #Enterprise AI

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.AI

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for LLM Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Large Language Model Powered Agentic System

RAIL Guard: Closing the Evaluation-to-Remediation Gap in Responsible AI for Agents

Automatic Ordinary Differential Equations Discovery For Biological Systems Using Powered Agentic System