Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .
翻译:[translated abstract in Chinese]
实时交互式视频生成需要低延迟、流式处理以及可控的逐帧展开。现有的自回归扩散蒸馏方法通过将双向基模型蒸馏为少步自回归学生模型,在分块式四步生成机制中取得了显著成果,但仍受限于粗粒度的响应粒度与不可忽视的采样延迟。本文探索了更具挑战性的设定:仅需1至2个采样步的逐帧自回归生成。在此设定下,我们发现少步自回归学生模型的初始化成为关键瓶颈:现有策略要么与目标对齐不足,要么无法实现少步生成,抑或计算成本过高难以扩展。为此,我们提出**因果强迫++**(Causal Forcing++)——一种原理清晰且可扩展的流水线方法,通过引入*因果一致性蒸馏*(causal CD)实现少步自回归初始化。其核心思想是:因果一致性蒸馏学习与因果常微分方程蒸馏相同的自回归条件流映射,但仅需相邻时间步间单个在线教师常微分方程步的监督,从而避免预计算与存储完整概率流常微分方程轨迹。这使得初始化过程兼具高效性与易优化性。基于该流水线的模型\ours{}在**逐帧二步生成设定**下,于VBench总指标、VBench质量指标及VisionReward指标上分别超越当前最优的4步分块式因果强迫方法0.1、0.3和0.335分,同时将首帧延迟降低50%,阶段二训练成本减少约4倍。我们进一步将该流水线扩展至基于动作控制的世界模型生成(遵循Genie3范式)。项目页面:https://github.com/thu-ml/Causal-Forcing 与 https://github.com/shengshu-ai/minWM。