Cosmos 3 introduces a unified omnimodal world model by NVIDIA, integrating language, image, video, audio, and action processing within a mixture-of-transformers architecture. It sets new benchmarks as the best open-source Text-to-Image and Image-to-Video models, and the top policy model, significantly advancing Physical AI capabilities.
arXiv:2606. 02800v1 Announce Type: new Abstract: We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework.
Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written.
To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1. 1 https://openmdw. ai/license/1-1/ License at https://github. com/nvidia/cosmos}{github. com/nvidia/cosmos and https://huggingface. co/collections/nvidia/cosmos3 . The project website is available at https://research. nvidia. com/labs/cosmos-lab/cosmos3 .
Reader Mode unavailable (could not extract clean content).
Daily brief at your local 8am — bilingual EN/中文, free.
Plan2Map introduces a 208-case benchmark for reconstructing geospatial boundaries from UK planning documents. The GeoPlanAgent system achieves a mean IoU of 0.736, significantly outperforming baseline models, highlighting the challenges in localization and map registration.
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.
The DL-TMPFC framework automates the quantification of TIMI Myocardial Perfusion Frame Count, significantly improving CMVD diagnosis accuracy with a bias of -0.93 frames and a correlation of r=0.98 against manual measurements. Validated on 655 patients, it enhances clinical workflow by eliminating observer dependence and enabling rapid, objective assessments of microvascular dysfunction.