Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity
Quick Take
The MindDiffuser framework enhances image reconstruction from brain activity by integrating semantic and structural guidance, significantly outperforming previous models. Utilizing CLIP text embeddings and Stable Diffusion, it refines images to maintain fine-grained structural consistency, validated through extensive experiments across fMRI, EEG, and MEG datasets.
Key Points
- MindDiffuser employs a two-stage framework for improved image reconstruction.
- Stage 1 uses CLIP text embeddings with Stable Diffusion for semantic image generation.
- Stage 2 refines images using shallow CLIP visual features for structural alignment.
- Extensive experiments show significant performance improvements over state-of-the-art models.
- Framework supports neurobiological plausibility for future neural decoding efforts.
Article Content
From source RSS / original summaryarXiv:2606. 00121v1 Announce Type: new Abstract: Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces.
Recent methods, leveraging advances in the power of text-to-image generation models, have reconstructed images that closely approximate complex natural stimuli in terms of semantics (e. g. , concepts and objects). However, they struggle to maintain consistency with the original stimuli in fine-grained structural information (e. g. , position, orientation and size), which undermines both the controllability and interpretability of the models.
To address the aforementioned issues, we propose a two-stage image reconstruction framework, termed MindDiffuser. In Stage 1, Contrastive Language-Image Pretraining (CLIP) text embeddings decoded from brain responses are input into Stable Diffusion, generating a preliminary image containing semantic information. In Stage 2, we use decoded shallow CLIP visual features as supervisory signals, iteratively refining the feature vectors from Stage 1 via backpropagation to align structural information.
We conducted extensive experiments on brain response datasets across three modalities (fMRI, EEG, MEG) elicited by visual stimuli, demonstrating that our framework significantly enhances the performance of previous state-of-the-art models, highlighting the effectiveness and versatility of our approach. Spatial and temporal visualization results further support the neurobiological plausibility of our framework, providing guidance for future neural decoding efforts across different brain signal modalities.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.