4DP-QA: Scalable QA for 4D Perception in Vision Language Models

arXiv cs.CV·Seokju Cho, Abhishek Badki, Hang Su, Jindong Jiang, Ziyao Zeng, Seungryong Kim, Sifei Liu, Orazio Gallo

2d ago

·~1 min·6/11/2026·en·0

Quick Answer

This paper shows that The 4DP-QA pipeline enhances Vision Language Models' understanding of 4D scenes by addressing camera and object motion entanglement, generating a dataset of 400K samples and a benchmark of 2.2K samples, leading to improved performance on external benchmarks.

Quick Take

The 4DP-QA pipeline enhances Vision Language Models' understanding of 4D scenes by addressing camera and object motion entanglement, generating a dataset of 400K samples and a benchmark of 2.2K samples, leading to improved performance on external benchmarks.

Key Points

Introduces True-Motion Tracking for clearer motion understanding in VLMs.
Generates a large-scale dataset of 400K samples for training.
Includes a benchmark of 2.2K samples for evaluation.
Improves performance of existing models on external benchmarks.
Addresses the challenges of indirect motion observation in VLMs.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 11568v1 Announce Type: new Abstract: Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion.

To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2. 2K-sample benchmark, 4DP-QA-Bench.

Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup