Video Models Can Reason with Verifiable Rewards

arXiv cs.CV·Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen

5/18/2026

·~2 min·5/18/2026·en·1

Quick Answer

This paper shows that The introduction of VideoRLVR optimizes video diffusion models for verifiable reasoning, enhancing performance on tasks like Maze, FlowFree, and Sokoban.

Quick Take

The introduction of VideoRLVR optimizes video diffusion models for verifiable reasoning, enhancing performance on tasks like Maze, FlowFree, and Sokoban. This method reduces training latency by 40% while achieving better results than existing models, indicating a shift towards reliable rule-consistent visual reasoning.

Key Points

VideoRLVR employs rule-based feedback to enhance video reasoning capabilities.
Training latency is reduced by approximately 40% using the Early-Step Focus strategy.
VideoRLVR outperforms both proprietary and open-source video generation models.
Dense decomposed rewards are crucial in low-success-rate scenarios.
Evaluated on procedurally generated domains with objective success criteria.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 14 May 2026]

View PDF HTML (experimental)

Abstract:Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

Comments:	Website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.15458 [cs.CV]
	(or arXiv:2605.15458v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.15458 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Tinghui Zhu [view email]
[v1] Thu, 14 May 2026 22:40:56 UTC (10,326 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

4w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup