VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference
Quick Answer
VigilFormer introduces a novel framework for video anomaly detection, utilizing Deformable Spatio-Temporal Attention and Causal Risk Inference.
Quick Take
VigilFormer introduces a novel framework for video anomaly detection, utilizing Deformable Spatio-Temporal Attention and Causal Risk Inference. It achieves AUC scores of 87.83%, 97.21%, and 89.74% on UCF-Crime, ShanghaiTech, and CUHK Avenue, respectively, while maintaining 41.5 FPS on a single GPU, outperforming recent methods in both accuracy and speed.
Key Points
- VigilFormer employs Deformable Spatio-Temporal Encoder to optimize attention across frames.
- Causal Anomaly Classifier uses dilated causal convolutions for snippet-level feature analysis.
- Adaptive Confidence Scheduler reduces computation by skipping low-information frames during inference.
- Achieves state-of-the-art AUC scores on multiple benchmarks while maintaining real-time performance.
- Outperforms recent weakly-supervised approaches in both speed and detection accuracy.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14724v1 Announce Type: new Abstract: Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video.
The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels.
To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87. 83%, 97. 21%, and 89. 74% respectively, at 41. 5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.


