Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

arXiv cs.CL·Cosimo Galeone, Anna Ettorre, Minsu Park, Giuseppe Ettorre, Daniele Ligorio

12h ago

·~2 min·6/25/2026·en·0

Quick Answer

Quick Take

This study reveals a significant gap between detection and control in language models, exemplified by Gemma 2-2B-it, where detection of hallucinations shows perfect linear separability (AUC = 1.000) but a cosine alignment of only 0.12 with the refusal direction. This indicates that knowing a behavior does not guarantee the ability to steer it effectively, challenging assumptions in mechanistic interpretability.

Key Points

Detection of hallucinations in Gemma 2-2B-it achieves AUC = 1.000.
The cosine alignment between detection and control is only 0.12, indicating a significant gap.
This gap persists across four models from three families, showing similar cosine values.
A 15-degree rotation toward the refusal direction improves performance on fake entities.
Detection is a high-dimensional class, complicating the predictability of steerability.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which detects a behavior and the direction which controls it are the same, or close. We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it?

If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format (clean JSON vs markdown fencing) collapses both roles onto one axis. Hallucination does not: the model detects fake entities with perfect linear separability (AUC = 1. 000 from layer 5), yet that direction sits at cos = 0. 12 (about 83 degrees) from the direction producing a refusal -- a small, reproducible alignment, far from the cos = 1 that "detection is control" would require.

A detector built from activations, with no chosen tokens, likewise fails to align (cos = -0. 06). The gap generalizes: across four models from three families and two scales (1B-9B), cos stays in [0. 12, 0. 20], identical before and after instruction tuning (0. 1197 vs 0. 1200), placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it -- 73% and 60% refusal on two held-out fake-entity categories at 1. 8% false positives.

We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1d ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems