Harnessing Self-Supervised Features for Art Classification
Quick Take
This paper explores self-supervised features for improved artwork classification and retrieval.
Key Points
- Evaluates DINO and CLIP models for artwork classification.
- Self-supervised backbones enhance classification performance.
- Insights applicable to VR museum navigation applications.
📖 Reader Mode
~2 min readAbstract:Classifying artworks presents a significant challenge due to the complex interplay of fine-grained details and abstract features that condition the style or genre of an artwork. This paper presents a systematic investigation of the effectiveness of supervised and self-supervised backbones as feature extractors for both artwork classification and retrieval, with a particular focus on paintings. We conduct an extensive experimental evaluation using the DINO family and CLIP models, assessing multiple classification strategies and feature representations. Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance. Moreover, our work provides insights into the applicability of classification and retrieval modules in real-world applications, such as virtual reality (VR) applications that support museum navigation.
| Comments: | IRCDL 2026 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM) |
| Cite as: | arXiv:2605.18974 [cs.CV] |
| (or arXiv:2605.18974v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.18974 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Evelyn Turri [view email]
[v1]
Mon, 18 May 2026 18:00:59 UTC (14,142 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.