COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection
Quick Take
COPRA introduces a dynamic parameter adaptation framework for improved video anomaly detection using vision-language models.
Key Points
- Addresses training-inference mismatch in VAD methods.
- Generates input-specific updates for each video segment.
- Outperforms static baselines in various tasks.
📖 Reader Mode
~2 min readAbstract:Vision-language models (VLMs) have shown strong performance in video anomaly detection (VAD) while providing interpretable predictions. However, existing VLM-based VAD methods suffer from a fundamental mismatch between training and inference in both data distribution and model configuration. First, most approaches rely on static post-training adaptation, limiting generalization under distribution shifts such as unseen environments or anomaly types. Second, they train VLMs on sparse frames from long videos, but perform inference on densely sampled short segments, creating inconsistencies between training and testing. To address these limitations, we propose COPRA, a conditional parameter adaptation framework for VLM-based VAD. Instead of fixed prompts or shared parameter updates, COPRA generates input-specific parameter updates to dynamically adapt a frozen VLM for each video segment during both training and inference. Experiments show strong performance on standard VAD benchmarks, consistently outperforming static baselines in both in-domain and cross-domain settings. Moreover, COPRA generalizes beyond VAD to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning. These results highlight COPRA as an effective weight-space generation framework for scalable, adaptive, and context-aware video understanding. The code will be released at this https URL
| Comments: | Manuscript currently under review for publication |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2605.15325 [cs.CV] |
| (or arXiv:2605.15325v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15325 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Pan He [view email]
[v1]
Thu, 14 May 2026 18:39:40 UTC (14,154 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.