We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality and temporally coherent videos. We achieve this by integrating two diffusion models: a Latent Image Diffusion Model (LIDM) trained on images and a Latent Video Diffusion Model (LVDM) trained on video data. Our method combines these models in the reverse diffusion process, where the LIDM enhances image quality and the LVDM ensures temporal consistency. This unique combination allows us to effectively handle the complex spatio-temporal dynamics in video generation. Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.
翻译:我们提出了联合视频-图像扩散模型(JVID),这是一种生成高质量且时间一致视频的新方法。我们通过整合两个扩散模型实现这一目标:一个是在图像上训练的潜在图像扩散模型(LIDM),另一个是在视频数据上训练的潜在视频扩散模型(LVDM)。我们的方法在反向扩散过程中结合这两个模型,其中LIDM提升图像质量,而LVDM确保时间一致性。这种独特的组合使我们能够有效处理视频生成中复杂的时空动态。我们的结果在生成逼真且连贯的视频方面展示了定量和定性的改进。