Temporal Backtracking Search for Test-time Generative Video Reasoning

arXiv cs.CV·Sejoon Jun, Zheng Ding, Huangyuan Su, Weirui Ye, Yilun Du

6h ago

·~2 min·6/15/2026·en·0

Quick Answer

This paper shows that The Temporal Backtracking Search (TBS) framework enhances generative video reasoning by shifting focus to the temporal axis, achieving a 22.7% success rate in out-of-distribution settings compared to 0.7% for traditional Best-of-N sampling.

Quick Take

The Temporal Backtracking Search (TBS) framework enhances generative video reasoning by shifting focus to the temporal axis, achieving a 22.7% success rate in out-of-distribution settings compared to 0.7% for traditional Best-of-N sampling. TBS utilizes variable-K conditioning, temporal process verification, and prefix-based search to improve efficiency and unlock the local reasoning capabilities of video models.

Key Points

TBS transforms video generation into an iterative generate-verify-restart loop.
Achieves 22.7% success in out-of-distribution settings, outperforming 0.7% for Best-of-N.
Utilizes variable-K conditioning to resume generation from any clean prefix.
Employs temporal process verification to identify failures and valid restart points.
Reallocates compute resources to extend correct trajectories rather than resampling.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 13861v1 Announce Type: new Abstract: While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process.

Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis.

TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN.

In a strict out-of-distribution setting where one-shot generation collapses (0. 7% for BoN), TBS achieves 22. 7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup