The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.
翻译:自回归视频扩散模型因其因果建模和迭代去噪特性,近期获得了广泛的研究关注。本研究发现,此类模型中的多头自注意力机制对历史帧的利用存在不足:约25%的注意力头几乎仅关注当前帧,且丢弃其键值缓存仅导致轻微的性能下降。基于此发现,我们提出虚拟头强制机制——一种简洁而有效的方法,用于控制不同注意力头对上下文信息的访问权限。具体而言,所提出的异构内存分配策略降低了注意力头间的上下文冗余,并结合动态头编程技术自适应地分类注意力头类型。此外,我们开发了上下文打包技术以实现更激进的缓存压缩。在不需额外训练的情况下,本方法相比基线模型实现了最高2.0倍的加速,能以24.3 FPS的速度生成视频且质量下降小于0.5%。项目页面详见 https://csguoh.github.io/project/DummyForcing/。