Deep Psychovisual Image Representations
Quick Take
The paper introduces Deep Visual Coding, a novel psychovisual-based deep learning framework that utilizes learned frequency-domain representations to enhance interpretability and efficiency in vision models. Unlike traditional CNNs, this approach extracts salient features with less depth dependency, promising improved transparency in decision-making processes.
Key Points
- Deep Visual Coding leverages frequency-domain representations for enhanced feature extraction.
- Models show improved interpretability compared to traditional CNNs with amorphous feature regions.
- Less depth dependency allows for better model scaling and efficiency.
- Psychovisual models encode task-relevant structures within distinct frequency sub-bands.
- Findings suggest a path toward more transparent vision models.
Article Content
From source RSS / original summaryarXiv:2605. 29260v1 Announce Type: new Abstract: Psychovisual models suggest human vision decouples low-level feature extraction from higher cognition by first forming intermediate abstractions. In contrast, deep learning-based vision models routinely extract and aggregate features using homogeneous stacks of spatial layers, rendering their decision-making processes opaque.
In this paper, we propose Deep Visual Coding, a learned frequency-domain representation inspired by 1990s image codes that quantised perceptually salient frequencies, which together with complex-valued image representations produces psychovisual-style abstractions. This approach enables the first psychovisual-based deep learning framework, utilizing data-driven spectral filters that learn to encode task-relevant semantic structures within distinct frequency sub-bands.
Salience analyses reveal that our psychovisual models extract highly interpretable object parts compared to the amorphous regions produced by regular Convolutional Neural Networks (CNNs). Furthermore, we find that our models are less depth dependent than CNNs for model scaling, since our complex-valued representations and learned abstractions subsume the role of the deep spatial layers.
Together, these findings demonstrate that psychovisual coding provides a promising path toward more efficient and transparent vision models.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.