Lightweight Complementary-Cue Fusion for Robust Video Face Forgery Detection
Quick Take
The study presents LFWS and LFWL, lightweight models that enhance video face forgery detection accuracy from 74.8% to 78.6% on FaceForensics++ and from 70.5% to 74.9% on DFDC-Preview, using only 292 additional parameters. These models outperform existing detectors like F3Net and SRM while maintaining a total of 21.9 million parameters, suggesting a reevaluation of design choices in this field.
Key Points
- LFWS and LFWL models utilize a lightweight fusion of handcrafted cues.
- Achieved AUC improvements of 3.8% and 4.4% on key benchmarks.
- Total model size remains at 21.9 million parameters, smaller than F3Net.
- Outperform F3Net, SRM, and SPSL across eight public benchmarks.
- Suggests a shift in design philosophy for face forgery detection.
Article Content
From source RSS / original summaryarXiv:2605. 29092v1 Announce Type: new Abstract: Current face video forgery detectors use wide or dual-stream backbones. We show that a single, lightweight fusion of two handcrafted cues can achieve higher accuracy with a much smaller model. Based on the Xception baseline model (21.
9 million parameters), we build two detectors: LFWS, which adds a 1x1 convolution to combine a low-frequency Wavelet-Denoised Feature (WDF) with a phase-spectrum channel derived from Spatial-Phase Shallow Learning (SPSL), and LFWL, which merges WDF with Local Binary Patterns (LBP) in the same way. This extra module adds only 292 parameters, keeping the total at 21. 9 million, smaller than F3Net (22. 5 million) and less than half the size of SRM (55. 3 million).
Even with this minimal overhead, the fused models increase the average area under the curve (AUC) from 74. 8% to 78. 6% on FaceForensics++ and from 70. 5% to 74. 9% on DFDC-Preview, gains of 3. 8% and 4. 4% over the Xception baseline. They also consistently outperform F3Net, SRM, and SPSL in eight public benchmarks, without extra data or test-time augmentation.
These results show that carefully paired, handcrafted features, combined through the lightweight fusion block, can provide competitive robustness at a significantly lower cost than comparable frequency-based detectors. Our findings suggest a need to reevaluate scale-driven design choices in face video forgery detection.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
