Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to $5\times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.
翻译:近年来,视频合成技术的进步引起了广泛关注。AnimateDiff和Stable Video Diffusion等视频合成模型已证明扩散模型在创建动态视觉内容方面的实际应用潜力。SORA的出现进一步凸显了视频生成技术的巨大前景。然而,视频时长的扩展一直受到计算资源限制的制约。现有的大多数视频合成模型仅能生成短时视频片段。本文提出了一种新颖的视频合成模型后调优方法,称为ExVideo。该方法旨在增强现有视频合成模型的能力,使其能够以较低的训练成本生成长时程内容。具体而言,我们针对常见的时序模型架构(包括3D卷积、时序注意力和位置嵌入)分别设计了扩展策略。为评估所提后调优方法的有效性,我们在Stable Video Diffusion模型上进行了扩展训练。该方法将模型的生成能力提升至原始帧数的$5\times$,仅需在包含4万条视频的数据集上进行1.5k GPU小时的训练。值得注意的是,视频时长的显著增加并未损害模型固有的泛化能力,且该模型在生成多种风格和分辨率的视频方面展现出优势。我们将公开源代码及增强后的模型。