SwordBench: Evaluating Orthogonality of Steering Image Representations

arXiv cs.CV·Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek

1d ago

·~2 min·5/19/2026·en·2

Quick Take

SwordBench introduces a benchmark for evaluating steering image representations in vision models.

Key Points

Evaluates orthogonality of concept activation vectors.
Measures cross-concept robustness and collateral damage.
Source code to be released on GitHub.

📖 Reader Mode

~2 min read

[Submitted on 10 May 2026]

View PDF HTML (experimental)

Abstract:Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2605.16372 [cs.CV]
	(or arXiv:2605.16372v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.16372 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Vladimir Zaigrajew [view email]
[v1] Sun, 10 May 2026 14:45:52 UTC (7,860 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

SwordBench: Evaluating Orthogonality of Steering Image Representations

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets