Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
翻译:因果性——指代组件间具有时间性、单向性的因果关系——是许多复杂生成过程(包括视频、语言和机器人轨迹)的基础。当前的因果扩散模型将时间推理与迭代去噪过程纠缠在一起,在所有网络层、每个去噪步骤以及整个上下文范围内均应用因果注意力机制。本文证明,这些模型中的因果推理可与多步去噪过程实现分离。通过对自回归视频扩散模型的系统性探查,我们揭示出两个关键规律:(1)浅层网络在去噪步骤间生成高度相似的特征,表明沿扩散轨迹存在冗余计算;(2)深层网络表现出稀疏的跨帧注意力机制,主要执行帧内渲染任务。基于这些发现,我们提出了可分离因果扩散模型——一种通过因果Transformer编码器实现每帧一次的时间推理,并通过轻量级扩散解码器完成多步逐帧渲染的新架构,从而显式解耦了这两个过程。在合成与真实基准数据集上进行的预训练及后训练任务的广泛实验表明,SCD在保持或超越强因果扩散基线模型生成质量的同时,显著提升了吞吐量与单帧延迟性能。