Probing the Misaligned Thinking Process of Language Models
Quick Answer
This study identifies 18 misalignment indicators in large language models, achieving a 0.935 AUROC on out-of-distribution benchmarks while maintaining low false positives.
Quick Take
This study identifies 18 misalignment indicators in large language models, achieving a 0.935 AUROC on out-of-distribution benchmarks while maintaining low false positives. The approach involves monitoring cognitive processes through linear probes, crucial for ensuring safe deployment in high-stakes environments.
Key Points
- Developed a taxonomy of 18 indicators for misaligned behaviors in language models.
- Achieved 0.935 AUROC on out-of-distribution benchmarks with low false positive rates.
- Utilized linear probes to detect misalignment in internal model activations.
- Created an automated pipeline for generating multi-turn training conversations.
- Conducted in-depth analysis of model representations related to misalignment.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 24251v1 Announce Type: new Abstract: Large language models exhibit a growing range of misaligned behaviors such as strategic deception, sandbagging, and self-preservation. As they are increasingly deployed in high-stakes settings, it is critical to reliably detect such behaviors to ensure safe and responsible use.
In this work, we propose to monitor misalignment by decomposing it into fine-grained cognitive processes -- misalignment indicators -- and detecting their presence in a model's internal activations via linear probes. We develop a taxonomy of 18 indicators spanning different misaligned behaviors, paired with an automated, meta-plan-guided pipeline that generates multi-turn training conversations.
To rigorously evaluate generalization, we construct an out-of-distribution suite combining automated behavioral elicitation, established misalignment benchmarks, and natural benign conversations. Across 5 misaligned behaviors, our probes match a strong LLM judge with 0. 935 AUROC on out-of-distribution benchmarks while keeping a low false positive rate on benign traffic. We further perform in-depth analysis to understand the probes and the model's internal representations of misalignment indicators.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.