SwordBench: Evaluating Orthogonality of Steering Image Representations
Quick Take
SwordBench introduces a benchmark for evaluating steering image representations in vision models.
Key Points
- Evaluates orthogonality of concept activation vectors.
- Measures cross-concept robustness and collateral damage.
- Source code to be released on GitHub.
📖 Reader Mode
~2 min readAbstract:Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.16372 [cs.CV] |
| (or arXiv:2605.16372v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16372 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Vladimir Zaigrajew [view email]
[v1]
Sun, 10 May 2026 14:45:52 UTC (7,860 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.