Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness

2h ago

·~2 min·6/2/2026·en·0

Quick Take

This study reveals that positional encodings (PEs) significantly enhance the robustness of Vision Transformers (ViTs) by anchoring spatial structures. Using a new metric, SSDC, the research shows that ViTs without PEs still develop spatial structure but lack stability under content disruptions, while all PEs lead to improved robustness against distribution shifts.

Key Points

ViTs without PEs develop spatial structure but collapse under token permutation.
All PEs (learned absolute, sinusoidal, rotary) improve robustness against distribution shifts.
Spatial Similarity Distance Correlation (SSDC) quantifies spatial structure in token representations.
Robustness is linked to stable positional references rather than specific encoding mechanisms.
Different PEs yield distinct depth-wise trajectories but similar robustness properties.

Article Content

From source RSS / original summary

arXiv:2606. 00124v1 Announce Type: new Abstract: Positional embeddings (PEs) in Vision Transformers (ViTs) are known to impact performance and robustness, but their role in shaping internal spatial representations is not well understood. In this work, we study how different forms of PEs influence the representational geometry of ViTs and how these changes relate to robustness under content-disrupting distribution shifts.

We introduce a metric, the Spatial Similarity Distance Correlation (SSDC), to quantify spatial structure in token representations. Using this metric, we show that ViTs trained without PEs still develop non-trivial spatial structure, but this structure is driven by visual content and collapses under token permutation. In contrast, we find that all PEs considered (learned absolute, sinusoidal, and rotary) are associated with a consistent shift toward an index-anchored spatial organization.

Representations in these models remain stable under perturbations that disrupt content, and exhibit substantially improved robustness to such distributional shifts.

We further show that while different PEs produce distinct depth-wise trajectories of spatial structure, their robustness properties are largely similar (with secondary variation across encoding schemes), suggesting that robustness appears to depend on the presence of a stable positional reference frame more than it depends on the specific encoding mechanism.

These results offer a geometric account of how positional encodings shape internal representations, with implications for the principled design of future encoding schemes.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

6d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.

#AI Coding #Inference #Open Source

Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Related in this space

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

TorqueAGI Announces Collaborations with NVIDIA, John Deere, and Dexterity to Advance Physical AI for Enterprise-Grade Robots

FORT Robotics Acquires Mapless AI to Expand Its Trust Platform with Remote Supervision and Active Safety Capabilities