AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.
翻译:AI生成内容近期引起了广泛关注,但生成照片级真实的视频仍面临挑战。尽管该领域已有诸多基于生成对抗网络和自回归模型的尝试,但生成视频的视觉质量与时长仍远未达到令人满意的程度。扩散模型虽已展现出显著效果,却需要巨大的计算资源。为此,我们通过利用低维三维潜空间引入了轻量级视频扩散模型,在有限计算预算下显著超越了以往的像素空间视频扩散模型。此外,我们在潜空间中提出分层扩散方法,可生成包含上千帧的更长视频。为克服长视频生成中的性能退化问题,我们进一步提出条件潜扰动与无条件引导,有效缓解了视频长度扩展过程中累积的误差。在不同类别的小规模领域数据集上进行的大量实验表明,我们的框架能生成比以往强基线方法更逼真且更长的视频。我们还将其扩展至大规模文本到视频生成任务,以证明本工作的优越性。我们的代码和模型将公开发布。