VLM3: Vision Language Models Are Native 3D Learners

arXiv cs.CV·Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu, Vikas Chandra, Yangyang Shi

5h ago

·~1 min·6/1/2026·en·0

Quick Take

VLM3 introduces a simplified approach for Vision Language Models, demonstrating that focal length unification, text-based pixel reference, and data scaling are sufficient for effective 3D learning. This model significantly improves depth estimation accuracy from 0.84 to 0.9 and enables various 3D tasks while maintaining standard architectures.

Key Points

VLM3 achieves depth estimation accuracy improvement from 0.84 to 0.9.
Focus on focal length unification and text-based pixel references for 3D learning.
Enables diverse 3D tasks like camera pose estimation and pixel correspondence.
Simplifies the model architecture without complex designs or heavy augmentations.
Proposes a new paradigm for scalable 3D learning with standard VLMs.

Article Content

From source RSS / original summary

arXiv:2605. 30561v1 Announce Type: new Abstract: Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners.

Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions.

As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0. 84 -> 0. 9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training.

We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

5d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.

#AI Coding #Inference #Open Source