DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

arXiv cs.CV·Hao Vo, Khoa Vo, Phu Loc Nguyen, Sieu Tran, Duc Minh Nguyen, Ngo Xuan Cuong, Gladys Gawugah, Sreevenkata Anjani Tishita Godavarthi, Chase Rainwater, Nghi D. Q. Bui, Anh Nguyen, Duy Minh Ho Nguyen, Ngan Le

9h ago

·~2 min·5/25/2026·en·0

Quick Take

DriveSpatial benchmark assesses spatiotemporal intelligence in VLMs for autonomous driving with 15.6K QA pairs.

Key Points

Focuses on dynamic, multi-relational scene understanding.
Evaluates cognitive scene construction and temporal reasoning.
Reveals significant performance gaps between models and humans.

Article Content

From source RSS / original summary

arXiv:2605. 23176v1 Announce Type: new Abstract: Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics.

However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15. 6K human-verified QA pairs across 20 tasks from five large-scale AD datasets.

DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning.

Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28. 4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence.

DriveSpatial and its construction pipeline will be released to support future research.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Flow Mismatching: Unsupervised Anomaly Detection via Velocity Discrepancies in Flow Matching Models

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

Related in this space

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines