Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

arXiv cs.CV·Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

3d ago

·~2 min·5/14/2026·en·1

Quick Take

The paper critiques current video anomaly detection methods for neglecting scene-specific normality modeling.

Key Points

Focus on general models limits anomaly detection effectiveness.
Current methods rely on weak supervision and MLLMs.
Emphasizes need for single-scene, spatially-aware approaches.

📖 Reader Mode

~2 min read

[Submitted on 12 May 2026]

View PDF HTML (experimental)

Abstract:Recent video anomaly detection research has expanded rapidly with an emphasis on general models of normality intended to work across many different scenes. While this focus has led to improvements in scalability and multi-scene generalization, it has also shifted the field away from modeling the scene-specific and context-dependent nature of normal behavior. Contemporary approaches frequently rely on video-level weak supervision and opaque pretrained representations from multi-modal large language models (MLLMs), which encourage models to respond to familiar semantic anomaly categories rather than to deviations from the normal patterns of a particular environment. This trend suppresses spatial localization, introduces semantic bias, and reduces anomaly detection to a form of action recognition. In this paper, we examine whether these prevailing formulations align with the core requirements of real-world VAD, which is typically performed within a single scene where normality is determined by local geometry, semantics, and activity patterns. Through targeted visual analyses and empirical evaluations, we demonstrate the practical consequences of these limitations and show that meaningful progress in VAD requires renewed focus on single-scene, spatially-aware, and explainable formulations that capture the nuanced structure of normality within individual environments.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.12725 [cs.CV]
	(or arXiv:2605.12725v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.12725 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Michael Jones [view email]
[v1] Tue, 12 May 2026 20:29:49 UTC (3,473 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

Related in this space

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

Distribution-Aware Algorithm Design with LLM Agents