MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
Quick Answer
MotionEnhancer enhances Vision-Language Models (VLMs) by integrating motion priors from Video Diffusion Models (VDMs), achieving significant improvements in motion understanding without additional training.
Quick Take
MotionEnhancer enhances Vision-Language Models (VLMs) by integrating motion priors from Video Diffusion Models (VDMs), achieving significant improvements in motion understanding without additional training. The approach includes parameter-free modules for optimizing motion-related attention, outperforming state-of-the-art VLMs on motion-level benchmarks.
Key Points
- MotionEnhancer utilizes motion priors from Video Diffusion Models for VLM enhancement.
- It features two parameter-free modules: Motion-sensitive Head Selection and Motion-salient Text Token Identification.
- The method achieves consistent improvements on motion-level video understanding benchmarks.
- No additional training parameters or architecture modifications are required.
- MotionEnhancer excels particularly in motion-related metrics.
Article Content
From source RSS / original summaryarXiv:2606. 06853v1 Announce Type: new Abstract: The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic.
In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment.
MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or .
Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.