Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models
Quick Take
Embodied3DBench introduces a benchmark for evaluating low-level spatial intelligence in Vision Language Models (VLMs) within 3D environments. It assesses 13 state-of-the-art models across 6 task categories, revealing strengths in high-level reasoning but weaknesses in interaction-oriented perception. A new dataset of 1.3M QA pairs is provided to enhance model training and performance.
Key Points
- Embodied3DBench evaluates VLMs on 6 task categories related to spatial intelligence.
- 13 state-of-the-art models were tested, showing gaps in interaction-oriented perception.
- The benchmark includes over 21,000 high-quality question-answer pairs.
- A large-scale dataset of 1.3M QA pairs was synthesized for model fine-tuning.
- Fine-tuning on this dataset significantly improves low-level spatial intelligence.
Article Content
From source RSS / original summaryarXiv:2605. 29074v1 Announce Type: new Abstract: Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments.
To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs.
We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1. 3M QA pairs.
Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
