ViT-Up: Faithful Feature Upsampling for Vision Transformers

arXiv cs.CV·Krispin Wandel, Jingchuan Wang, Hesheng Wang

6h ago

·~1 min·6/15/2026·en·0

Quick Answer

ViT-Up introduces an implicit feature upsampling framework for Vision Transformers, enhancing dense prediction tasks.

Quick Take

ViT-Up introduces an implicit feature upsampling framework for Vision Transformers, enhancing dense prediction tasks. It outperforms existing methods with improvements of up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k, demonstrating scalability with larger backbones.

Key Points

ViT-Up replaces external image guidance with layer-wise query construction from ViT hidden states.
Achieves state-of-the-art performance in dense prediction and semantic correspondence tasks.
Improves mIoU by +3.36 and PCK@0.10 by +8.09 with DINOv3-B backbone.
Addresses issues of feature leakage and blur in traditional image-guided upsampling methods.
Demonstrates consistent performance gains across various benchmarks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 14024v1 Announce Type: new Abstract: Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation.

This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states.

This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2. 07 mIoU on Cityscapes and +4. 17 PCK@0. 10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3. 36 mIoU and +8. 09 PCK@0.

10, demonstrating that ViT-Up scales favorably with backbone capacity.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup