ViT-Up: Faithful Feature Upsampling for Vision Transformers
Quick Answer
ViT-Up introduces an implicit feature upsampling framework for Vision Transformers, enhancing dense prediction tasks.
Quick Take
ViT-Up introduces an implicit feature upsampling framework for Vision Transformers, enhancing dense prediction tasks. It outperforms existing methods with improvements of up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k, demonstrating scalability with larger backbones.
Key Points
- ViT-Up replaces external image guidance with layer-wise query construction from ViT hidden states.
- Achieves state-of-the-art performance in dense prediction and semantic correspondence tasks.
- Improves mIoU by +3.36 and PCK@0.10 by +8.09 with DINOv3-B backbone.
- Addresses issues of feature leakage and blur in traditional image-guided upsampling methods.
- Demonstrates consistent performance gains across various benchmarks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14024v1 Announce Type: new Abstract: Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation.
This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states.
This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2. 07 mIoU on Cityscapes and +4. 17 PCK@0. 10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3. 36 mIoU and +8. 09 PCK@0.
10, demonstrating that ViT-Up scales favorably with backbone capacity.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.