Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation
Quick Answer
This paper shows that The Physics Question Scene Graph (PQSG) introduces a novel evaluation method for text-to-video generation, assessing physical plausibility through a hierarchical question framework.
Quick Take
The Physics Question Scene Graph (PQSG) introduces a novel evaluation method for text-to-video generation, assessing physical plausibility through a hierarchical question framework. Validated with the FinePhyEval dataset, PQSG shows higher correlation with human judgments than previous methods and ranks closed-source models higher in physical realism than Wan 2.1.
Key Points
- PQSG evaluates videos based on object actions and physical law adherence.
- FinePhyEval dataset includes physics-based prompts and human-annotated video evaluations.
- PQSG shows improved correlation with human judgments over prior evaluation methods.
- Closed-source models rank higher in physical realism compared to Wan 2.1 using PQSG.
- generate human-like questions but underperform in answering them.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.
| Comments: | ECCV 2026. Code and data: this https URL |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.25306 [cs.CV] |
| (or arXiv:2606.25306v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.25306 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Atin Pothiraj [view email]
[v1]
Wed, 24 Jun 2026 02:12:54 UTC (4,553 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.