While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process. Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis. TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.
翻译:尽管测试时扩展已在大型语言模型中彻底革新了推理能力,但生成式视频推理仍受限于一次性推理范式。我们证明,在去噪步骤中进行搜索无法挽救逻辑有缺陷的展开,因为空间轨迹在扩散过程早期就已定型。根级最佳N选采样同样低效:推理误差沿时间轴早期聚集,而重采样会盲目丢弃已验证的上游进展。为解锁视频模型的有效测试时扩展,我们提出时间回溯搜索,将搜索空间迁移至时间轴。TBS通过三项核心机制将视频生成转化为迭代式生成-验证-重启动循环:(1)可变K条件化,能从任意干净前缀恢复生成;(2)时间过程验证,用于定位失败并提取有效重启动锚点;(3)基于前缀的搜索,将计算资源重新分配至扩展正确轨迹而非根重采样。在算法、导航和机器人领域,TBS帕累托支配了同等预算的BoN。在严格分布外场景中,一次性生成方法失效(BoN仅0.7%),而TBS达到22.7%,且每个已解决片段均源自重启分支。最终,TBS揭示了视频模型的局部推理能力远超一次性展开所表现的水平,为解锁该能力提供了可扩展的测试时框架。