CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection
Quick Take
CoCoVideo-26K introduces a benchmark dataset for detecting AI-generated video forgery, leveraging 13 commercial generators and providing real-fake video pairs. The CoCoDetect framework, utilizing contrastive learning and multimodal large language model inference, achieves state-of-the-art performance in detecting high-fidelity AIGC videos, addressing challenges in existing detection methods.
Key Points
- CoCoVideo-26K includes 26,000 contrastive video pairs from commercial AIGC systems.
- The dataset enables better differentiation between authentic and synthetic videos.
- CoCoDetect integrates an R3D-18 backbone for spatio-temporal representation extraction.
- Extensive experiments show state-of-the-art performance on public benchmarks.
- Code and dataset are publicly available on GitHub.
Article Content
From source RSS / original summaryarXiv:2606. 00101v1 Announce Type: new Abstract: With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems.
Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs.
This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference.
An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework's robustness and generalizability. Our code and dataset are available at https://github. com/DonoToT/CoCoVideo.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.