To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}
翻译:为实现实时交互式视频生成,现有方法将预训练的双向视频扩散模型蒸馏为少步自回归模型,在将完整注意力替换为因果注意力时面临架构差异。然而,现有方法未从理论上弥合这一差异。它们通过ODE蒸馏初始化自回归学生模型,这要求满足帧级单射性条件——即每个含噪帧在自回归教师的PF-ODE下必须映射到唯一的干净帧。从双向教师蒸馏自回归学生违反了该条件,导致无法恢复教师的流形映射,转而产生条件期望解,从而降低生成性能。为解决该问题,我们提出因果强制方法,采用自回归教师进行ODE初始化,从而弥合架构差异。实验结果表明,我们的方法在所有指标上均优于基线模型,其中动态度指标超越当前最优的Self Forcing方法19.3%,视觉奖励指标提升8.7%,指令跟随指标提高16.7%。项目页面与代码:\href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}