MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction
Quick Answer
MolmoMotion introduces a novel approach to 3D point motion forecasting using language instructions, achieving significant improvements over existing baselines on the PointMotionBench.
Quick Take
MolmoMotion introduces a novel approach to 3D point motion forecasting using language instructions, achieving significant improvements over existing baselines on the PointMotionBench. The model, trained on 1.16M videos, accurately predicts diverse motion patterns and enhances robot manipulation training efficiency.
Key Points
- MolmoMotion-1M corpus includes 1.16M videos with annotated 3D point trajectories.
- PointMotionBench features 111 object categories and 61 motion types for benchmarking.
- MolmoMotion model supports autoregressive prediction and flow-matching trajectory generation.
- The model significantly outperforms existing motion prediction methods on PointMotionBench.
- Learned 3D motion prior enhances training efficiency for robot manipulation tasks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 18558v1 Announce Type: new Abstract: Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks.
We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.
16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench.
Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.