MolmoMotion: Forecasting Point Trajectories in 3D with Language… | AI Deep Signal

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

arXiv cs.CV·Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

6/18/2026

·~2 min·6/18/2026·en·1

Quick Answer

MolmoMotion introduces a novel approach to 3D point motion forecasting using language instructions, achieving significant improvements over existing baselines on the PointMotionBench.

Quick Take

The model, trained on 1.16M videos, accurately predicts diverse motion patterns and enhances robot manipulation training efficiency.

Key Points

MolmoMotion-1M corpus includes 1.16M videos with annotated 3D point trajectories.
PointMotionBench features 111 object categories and 61 motion types for benchmarking.
MolmoMotion model supports autoregressive prediction and flow-matching trajectory generation.
The model significantly outperforms existing motion prediction methods on PointMotionBench.
Learned 3D motion prior enhances training efficiency for robot manipulation tasks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object

Read the full article on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Aavash Chhetri, Bibek Niroula, Eduard Vazquez, Yash Raj Shrestha, Prashnna Gyawali, Loris Bazzani, Binod Bhattarai

3w ago

FeaturedOriginal

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

AI Summary

ProMoE-FL introduces a Prototype-conditioned Mixture-of-Experts framework for multimodal federated learning, effectively addressing missing modalities. It outperforms existing methods on four chest X-ray datasets, demonstrating superior feature synthesis capabilities in both homogeneous and heterogeneous settings.

#LLM #AI Coding #AI Startup #Enterprise AI

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

-Guided ANN Index Optimization for Human-Object Interaction Retrieval

Eddeep: a deep-learning framework for fast eddy-current distortion correction in diffusion MRI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CV

ProMoE-FL: Prototype-conditioned Mixture of Experts for Multimodal Federated Learning with Missing Modalities

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

Eddeep: a deep-learning framework for fast eddy-current distortion correction in diffusion MRI

-Guided ANN Index Optimization for Human-Object Interaction Retrieval