This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observation-space scores in latent Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach. Our project page can be found at https://medm2023.github.io
翻译:本研究提出了一种高效且有效的视频翻译方法MeDM,该方法利用预训练图像扩散模型实现具有连续时间流的视频到视频转换。所提出的框架能够从场景位置信息(如标准G-buffer)生成视频,或对现实场景拍摄的视频进行文本引导编辑。我们采用显式光流构建了一种实用编码,该编码对生成的帧施加物理约束并调节独立的逐帧分数。通过利用这种编码,将生成视频的时间一致性保持问题转化为具有闭式解的优化问题。为确保与Stable Diffusion的兼容性,我们还提出了一种在潜在扩散模型中修改观测空间分数的变通方案。值得注意的是,MeDM无需对扩散模型进行微调或测试时优化。通过在多个基准数据集上进行的定性、定量及主观实验,本研究证明了所提方法的有效性和优越性。项目页面详见https://medm2023.github.io