CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection
Quick Take
CoReVAD introduces a training-free framework for video anomaly detection using a single frozen Vision-Language Model (VLM), achieving competitive results on UCF-Crime and XD-Violence datasets. It generates anomaly scores and temporal descriptions while providing interpretable insights, mitigating noise through a Local Response Cleaning module. This approach reduces training costs and inference overhead associated with traditional methods.
Key Points
- CoReVAD operates without task-specific training, reducing domain dependency.
- The framework utilizes a single frozen Vision-Language Model for anomaly detection.
- Local Response Cleaning module enhances output quality by aligning vision and text.
- Achieves competitive performance on UCF-Crime and XD-Violence benchmarks.
- Official code is available on GitHub for further exploration.
Article Content
From source RSS / original summaryarXiv:2605. 23116v1 Announce Type: new Abstract: Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning.
However, many VLM-based approaches still require additional training steps (e. g. , instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM.
To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.
com/Muk-00/CoReVAD
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.