Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation
Quick Take
ViTC-UNet enhances semantic segmentation by integrating Vision Transformers with UNet for biomedical applications.
Key Points
- Addresses performance gaps in Vision Transformers for biomedical segmentation.
- Combines global visual priors with local inductive bias.
- Outperforms baseline models in MRI and CT segmentation tasks.
📖 Reader Mode
~2 min readAbstract:Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.16393 [cs.CV] |
| (or arXiv:2605.16393v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16393 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Joel Valdivia Ortega [view email]
[v1]
Tue, 12 May 2026 11:56:46 UTC (40,256 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.