EMMA: Extracting Multiple physical parameters from Multimodal Data

arXiv cs.CV·Farhat Shaikh, Ayan Banerjee, Sandeep Gupta

4d ago

·~2 min·5/26/2026·en·2

Quick Take

EMMA is a physics-informed multimodal framework that extracts dynamical parameters from raw video, audio, and image data, outperforming existing methods in over 100 scenarios, including five standard benchmarks. It utilizes a Liquid Time-Constant network for latent dynamics learning and achieves robust multi-parameter recovery without requiring segmentation masks or specialized sensors.

Key Points

EMMA performs joint inference of parameters and dynamics within a unified continuous-time model.
It leverages a Liquid Time-Constant network to learn from heterogeneous modalities.
The framework shows significant performance improvements over single-modality methods.
EMMA is validated across diverse scenarios, including real-world rover and quadrotor systems.
Code and data are publicly available for further research and application.

Article Content

From source RSS / original summary

arXiv:2605. 24047v1 Announce Type: new Abstract: We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations.

Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations.

A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors.

Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data.

Code and data are available at: https://github. com/ImpactLabASU/EMMA-CVPR2026

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source