Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models
Quick Answer
This study reveals a significant gap between detection and control in language models, exemplified by Gemma 2-2B-it, where detection of hallucinations shows perfect linear separability (AUC = 1.000) but a cosine alignment of only 0.12 with the refusal direction.
Quick Take
This study reveals a significant gap between detection and control in language models, exemplified by Gemma 2-2B-it, where detection of hallucinations shows perfect linear separability (AUC = 1.000) but a cosine alignment of only 0.12 with the refusal direction. This indicates that knowing a behavior does not guarantee the ability to steer it effectively, challenging assumptions in mechanistic interpretability.
Key Points
- Detection of hallucinations in Gemma 2-2B-it achieves AUC = 1.000.
- The cosine alignment between detection and control is only 0.12, indicating a significant gap.
- This gap persists across four models from three families, showing similar cosine values.
- A 15-degree rotation toward the refusal direction improves performance on fake entities.
- Detection is a high-dimensional class, complicating the predictability of steerability.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which detects a behavior and the direction which controls it are the same, or close. We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it?
If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format (clean JSON vs markdown fencing) collapses both roles onto one axis. Hallucination does not: the model detects fake entities with perfect linear separability (AUC = 1. 000 from layer 5), yet that direction sits at cos = 0. 12 (about 83 degrees) from the direction producing a refusal -- a small, reproducible alignment, far from the cos = 1 that "detection is control" would require.
A detector built from activations, with no chosen tokens, likewise fails to align (cos = -0. 06). The gap generalizes: across four models from three families and two scales (1B-9B), cos stays in [0. 12, 0. 20], identical before and after instruction tuning (0. 1197 vs 0. 1200), placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it -- 73% and 60% refusal on two held-out fake-entity categories at 1. 8% false positives.
We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.