Temporal Backtracking Search for Test-time Generative Video Reasoning
Quick Answer
This paper shows that The Temporal Backtracking Search (TBS) framework enhances generative video reasoning by shifting focus to the temporal axis, achieving a 22.7% success rate in out-of-distribution settings compared to 0.7% for traditional Best-of-N sampling.
Quick Take
The Temporal Backtracking Search (TBS) framework enhances generative video reasoning by shifting focus to the temporal axis, achieving a 22.7% success rate in out-of-distribution settings compared to 0.7% for traditional Best-of-N sampling. TBS utilizes variable-K conditioning, temporal process verification, and prefix-based search to improve efficiency and unlock the local reasoning capabilities of video models.
Key Points
- TBS transforms video generation into an iterative generate-verify-restart loop.
- Achieves 22.7% success in out-of-distribution settings, outperforming 0.7% for Best-of-N.
- Utilizes variable-K conditioning to resume generation from any clean prefix.
- Employs temporal process verification to identify failures and valid restart points.
- Reallocates compute resources to extend correct trajectories rather than resampling.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 13861v1 Announce Type: new Abstract: While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process.
Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis.
TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN.
In a strict out-of-distribution setting where one-shot generation collapses (0. 7% for BoN), TBS achieves 22. 7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.