This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observed-space scores in latent-space Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach. Project page can be found at https://medm2023.github.io
翻译:本研究提出了一种高效且有效的方法 MeDM,利用预训练的图像扩散模型实现具有一致时间流的视频到视频翻译。所提出的框架可以从场景位置信息(如法线G-buffer)渲染视频,或对真实场景采集的视频进行文本引导编辑。我们采用显式光流构建一种实用编码,对生成的帧施加物理约束并对独立的逐帧得分进行中介。通过利用这种编码,可将生成视频的时间一致性保持问题转化为具有闭式解的优化问题。为确保与Stable Diffusion的兼容性,我们还提出了一种在潜空间扩散模型中修正观测空间得分的迂回方案。值得注意的是,MeDM无需对扩散模型进行微调或测试时优化。通过在多个基准上的广泛定性、定量和主观实验,研究证明了所提出方法的有效性和优越性。项目页面见https://medm2023.github.io。