Inspired by the remarkable success of Latent Diffusion Models (LDMs) for image synthesis, we study LDM for text-to-video generation, which is a formidable challenge due to the computational and memory constraints during both model training and inference. A single LDM is usually only capable of generating a very limited number of video frames. Some existing works focus on separate prediction models for generating more video frames, which suffer from additional training cost and frame-level jittering, however. In this paper, we propose a framework called "Reuse and Diffuse" dubbed $\textit{VidRD}$ to produce more frames following the frames already generated by an LDM. Conditioned on an initial video clip with a small number of frames, additional frames are iteratively generated by reusing the original latent features and following the previous diffusion process. Besides, for the autoencoder used for translation between pixel space and latent space, we inject temporal layers into its decoder and fine-tune these layers for higher temporal consistency. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets including video datasets for action recognition and image-text datasets. Extensive experiments show that our method achieves good results in both quantitative and qualitative evaluations. Our project page is available $\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$.
翻译:受潜在扩散模型在图像合成领域取得显著成功的启发,我们研究将其应用于文本生成视频这一艰巨任务,该任务在模型训练和推理过程中均面临计算与内存限制的挑战。单个潜在扩散模型通常仅能生成数量极为有限的视频帧。现有部分工作专注于通过独立的预测模型生成更多视频帧,然而这些方法存在额外训练成本与帧级抖动问题。本文提出名为“复用与扩散”($\textit{VidRD}$)的框架,旨在基于潜在扩散模型已生成的帧序列产生更多帧。在包含少量帧的初始视频片段条件下,通过复用原始潜在特征并沿袭先前的扩散过程,迭代生成补充帧。此外,针对用于像素空间与潜在空间转换的自编码器,我们在其解码器中注入时序层并微调这些层以提升时序一致性。我们还提出了一系列视频-文本数据组合策略,通过整合来自多个现有数据集(包括动作识别视频数据集及图像-文本数据集)中的多样化内容构建训练数据。大量实验表明,我们的方法在定量与定性评估中均取得了优异结果。项目主页详见$\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{此处}$。