MagicVideo: Efficient Video Generation With Latent Diffusion Models

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to \url{https://magicvideo.github.io/#} for more examples.

翻译：我们提出了一种基于潜在扩散模型的高效文本到视频生成框架，称为MagicVideo。MagicVideo能够生成与给定文本描述一致的流畅视频片段。通过新颖高效的3D U-Net设计以及在低维空间中建模视频分布，MagicVideo可在单个GPU上合成空间分辨率为256×256的视频片段，其FLOPs计算量约为视频扩散模型（VDM）的1/64。具体而言，与现有直接在RGB空间中训练视频模型的工作不同，我们使用预训练的VAE将视频片段映射到低维潜在空间，并通过扩散模型学习视频潜在编码的分布。此外，我们引入了两种新设计来适配在图像任务上训练的U-Net去噪器以处理视频数据：用于图像到视频分布调整的帧级轻量适配器，以及用于捕捉帧间时序依赖关系的定向时序注意力模块。从而，我们可利用文本到图像模型中卷积算子的信息权重来加速视频训练。为改善生成视频中的像素抖动问题，我们还提出了一种新颖的VideoVAE自编码器以实现更好的RGB重建。我们进行了大量实验，证明MagicVideo能够生成具有真实或虚构内容的高质量视频片段。更多示例请参见\url{https://magicvideo.github.io/#}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日