MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

arXiv cs.CV·Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen

3h ago

·~1 min·6/8/2026·en·0

Quick Answer

Quick Take

MotionEnhancer enhances Vision-Language Models (VLMs) by integrating motion priors from Video Diffusion Models (VDMs), achieving significant improvements in motion understanding without additional training. The approach includes parameter-free modules for optimizing motion-related attention, outperforming state-of-the-art VLMs on motion-level benchmarks.

Key Points

MotionEnhancer utilizes motion priors from Video Diffusion Models for VLM enhancement.
It features two parameter-free modules: Motion-sensitive Head Selection and Motion-salient Text Token Identification.
The method achieves consistent improvements on motion-level video understanding benchmarks.
No additional training parameters or architecture modifications are required.
MotionEnhancer excels particularly in motion-related metrics.

Article Content

From source RSS / original summary

arXiv:2606. 06853v1 Announce Type: new Abstract: The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic.

In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment.

MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or .

Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup