ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to $5\times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.

翻译：近年来，视频合成技术的进步引起了广泛关注。AnimateDiff和Stable Video Diffusion等视频合成模型已证明扩散模型在创建动态视觉内容方面的实际应用潜力。SORA的出现进一步凸显了视频生成技术的巨大前景。然而，视频时长的扩展一直受到计算资源限制的制约。现有的大多数视频合成模型仅能生成短时视频片段。本文提出了一种新颖的视频合成模型后调优方法，称为ExVideo。该方法旨在增强现有视频合成模型的能力，使其能够以较低的训练成本生成长时程内容。具体而言，我们针对常见的时序模型架构（包括3D卷积、时序注意力和位置嵌入）分别设计了扩展策略。为评估所提后调优方法的有效性，我们在Stable Video Diffusion模型上进行了扩展训练。该方法将模型的生成能力提升至原始帧数的$5\times$，仅需在包含4万条视频的数据集上进行1.5k GPU小时的训练。值得注意的是，视频时长的显著增加并未损害模型固有的泛化能力，且该模型在生成多种风格和分辨率的视频方面展现出优势。我们将公开源代码及增强后的模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日