Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to a causal transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher's ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model supports fast streaming generation of high quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.
翻译:当前视频扩散模型在生成质量方面取得了令人瞩目的成就,但由于双向注意力依赖,在交互式应用中面临困难。生成单个帧需要模型处理包括未来帧在内的整个序列。我们通过将预训练的双向扩散Transformer适配为可实时生成帧的因果Transformer来解决这一限制。为了进一步降低延迟,我们将分布匹配蒸馏(DMD)扩展至视频领域,将50步扩散模型蒸馏为4步生成器。为实现稳定且高质量的蒸馏,我们引入了基于教师模型常微分方程轨迹的学生模型初始化方案,以及用双向教师监督因果学生模型的不对称蒸馏策略。该方法有效缓解了自回归生成中的误差累积问题,使得尽管在短片段上训练,仍能实现长时视频合成。得益于KV缓存技术,我们的模型在单GPU上支持以9.4 FPS的速度进行高质量视频的快速流式生成。我们的方法还支持以零样本方式进行流式视频到视频转换、图像到视频生成以及动态提示。我们将在未来基于开源模型发布代码。