From Slow Bidirectional to Fast Causal Video Generators

Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to a causal transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher's ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model supports fast streaming generation of high quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.

翻译：当前视频扩散模型在生成质量方面取得了令人瞩目的成就，但由于双向注意力依赖，在交互式应用中面临困难。生成单个帧需要模型处理包括未来帧在内的整个序列。我们通过将预训练的双向扩散Transformer适配为可实时生成帧的因果Transformer来解决这一限制。为了进一步降低延迟，我们将分布匹配蒸馏（DMD）扩展至视频领域，将50步扩散模型蒸馏为4步生成器。为实现稳定且高质量的蒸馏，我们引入了基于教师模型常微分方程轨迹的学生模型初始化方案，以及用双向教师监督因果学生模型的不对称蒸馏策略。该方法有效缓解了自回归生成中的误差累积问题，使得尽管在短片段上训练，仍能实现长时视频合成。得益于KV缓存技术，我们的模型在单GPU上支持以9.4 FPS的速度进行高质量视频的快速流式生成。我们的方法还支持以零样本方式进行流式视频到视频转换、图像到视频生成以及动态提示。我们将在未来基于开源模型发布代码。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日