Demystifing Video Reasoning

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

翻译：近期视频生成领域的进展揭示了一个意想不到的现象：基于扩散的视频模型展现出非平凡的推理能力。先前研究将此归因于帧链机制，即假设推理过程在视频帧间顺序展开。本工作挑战了这一假设，并揭示了一种根本不同的机制。我们证明视频模型中的推理主要沿扩散去噪步骤涌现。通过定性分析和针对性探测实验，我们发现模型在早期去噪步骤中探索多个候选解，并逐步收敛至最终答案，这一过程我们称之为步骤链机制。除核心机制外，我们还识别出若干对模型性能至关重要的涌现推理行为：(1) 工作记忆，实现持久参照；(2) 自我校正与增强，允许从错误中间解中恢复；(3) 先感知后操作，即早期步骤建立语义基础，后期步骤执行结构化操控。在单个扩散步骤中，我们进一步发现扩散Transformer内部存在自演化的功能专门化现象：早期层编码密集感知结构，中间层执行推理，后期层整合潜在表征。基于这些发现，我们提出一种简单的免训练策略作为概念验证，通过集成具有不同随机种子的相同模型的潜在轨迹，展示了如何提升推理能力。总体而言，本工作系统阐释了视频生成模型中推理能力的涌现机制，为未来研究如何更好地利用视频模型固有的推理动态作为智能新基质奠定了理论基础。