A Geometric Account of Activation Steering through Angle-Norm Decomposition
Quick Answer
This study reveals that activation steering in language models, particularly through angular and radial components, significantly impacts model behavior.
Quick Take
This study reveals that activation steering in language models, particularly through angular and radial components, significantly impacts model behavior. By analyzing seven language models, it was found that concepts are primarily represented in angular structure, emphasizing the need for spherical methods while maintaining norm importance for stability.
Key Points
- Activation steering methods couple angular alignment and hidden-state norm effects.
- Seven language models were analyzed to understand concept representation.
- Concepts are primarily represented in angular structure, supporting spherical methods.
- Norm remains crucial for stability and downstream effects of steering.
- Interventions should be parameterized by interpretable angular and radial components.
Article Content
From source RSS / original summaryarXiv:2606. 06735v1 Announce Type: new Abstract: Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information.
In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm.
Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering.
Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.