CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

arXiv cs.CV·Adil Meric, Lin Geng Foo, Mert Kiray, Benjamin Busam, Rishabh Dabral, Christian Theobalt

9h ago

·~1 min·5/25/2026·en·0

Quick Take

CoMoGen is a framework for controllable video generation using binary masks and a novel transformer architecture.

Key Points

Generates realistic interactive dynamics from binary mask sequences.
Introduces a lightweight MaskAdapter for efficient encoding.
Achieves state-of-the-art performance in motion fidelity.

Article Content

From source RSS / original summary

arXiv:2605. 22996v1 Announce Type: new Abstract: We present CoMoGen, a controllable video generation framework that generates realistic interactive dynamics from a single binary mask sequence conditioned on an input image. CoMoGen introduces a lightweight MaskAdapter that encodes binary mask sequences into a latent residual signal, injected into the Multi Modal Diffusion Transformer (MMDiT) model through a cosine-weighted schedule.

Unlike the hierarchical coarse-to-fine design of UNet architectures, MMDiT operates as a sequence of uniform transformer blocks, making it difficult to identify which layers are responsible for the motion generation. Therefore, we propose a novel way to determine "Motion Layers" operating in the attention space of MMDiT. We fine-tune the model by using Low-Rank Adaptation (LoRA) to the Motion Layers, without requiring any architecture change in the MMDiT.

This selective adaptation enables our method to focus on motion-critical components, yielding reduced computational cost. Despite its simplicity, CoMoGen enables precise subject motion and plausible interactions with surrounding humans, objects, and scenes. Comprehensive experiments on different datasets show that CoMoGen consistently outperforms prior controllable video generation methods and achieves state-of-the-art performance in motion fidelity and perceptual realism. Project page: mericadil. github.

io/CoMoGen.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Flow Mismatching: Unsupervised Anomaly Detection via Velocity Discrepancies in Flow Matching Models

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

Related in this space

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines