4DP-QA: Scalable QA for 4D Perception in Vision Language Models
Quick Answer
This paper shows that The 4DP-QA pipeline enhances Vision Language Models' understanding of 4D scenes by addressing camera and object motion entanglement, generating a dataset of 400K samples and a benchmark of 2.2K samples, leading to improved performance on external benchmarks.
Quick Take
The 4DP-QA pipeline enhances Vision Language Models' understanding of 4D scenes by addressing camera and object motion entanglement, generating a dataset of 400K samples and a benchmark of 2.2K samples, leading to improved performance on external benchmarks.
Key Points
- Introduces True-Motion Tracking for clearer motion understanding in VLMs.
- Generates a large-scale dataset of 400K samples for training.
- Includes a benchmark of 2.2K samples for evaluation.
- Improves performance of existing models on external benchmarks.
- Addresses the challenges of indirect motion observation in VLMs.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 11568v1 Announce Type: new Abstract: Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion.
To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2. 2K-sample benchmark, 4DP-QA-Bench.
Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.