Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection
Quick Take
The proposed segmentation-guided spatial indexing method enhances deepfake detection by focusing on semantically meaningful facial patches, achieving an AUC of 0.905 on Celeb-DF v2. This approach outperforms existing models like LipForensics and Xception without requiring fine-tuning or target-domain data. The method's effectiveness hinges on DINOv3's spatial consistency and selective regional analysis.
Key Points
- Achieved AUC of 0.905 on Celeb-DF v2, outperforming LipForensics by 8.1 pp.
- Utilizes DINOv3 ViT-L/16 for semantic labeling of facial patch tokens.
- Method discards non-target tokens, focusing on relevant facial regions.
- Replacing regional selection with CLS token drops AUC by 26.4 pp.
- Both DINOv3 representation and spatial indexing are crucial for performance.
Article Content
From source RSS / original summaryarXiv:2606. 00098v1 Announce Type: new Abstract: We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region.
This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0. 905, outperforming LipForensics (+8. 1 pp) and Xception (+16.
9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26. 4 pp; replacing DINOv3 with FaRL features drops it by 20. 9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.
