Text-driven image and video diffusion models have recently achieved unprecedented generation realism. While diffusion models have been successfully applied for image editing, very few works have done so for video editing. We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos. Our approach uses a video diffusion model to combine, at inference time, the low-resolution spatio-temporal information from the original video with new, high resolution information that it synthesized to align with the guiding text prompt. As obtaining high-fidelity to the original video requires retaining some of its high-resolution information, we add a preliminary stage of finetuning the model on the original video, significantly boosting fidelity. We propose to improve motion editability by a new, mixed objective that jointly finetunes with full temporal attention and with temporal attention masking. We further introduce a new framework for image animation. We first transform the image into a coarse video by simple image processing operations such as replication and perspective geometric projections, and then use our general video editor to animate it. As a further application, we can use our method for subject-driven video generation. Extensive qualitative and numerical experiments showcase the remarkable editing ability of our method and establish its superior performance compared to baseline methods.
翻译:摘要:基于文本驱动的图像与视频扩散模型近期在生成逼真度上取得了前所未有的成果。尽管扩散模型已成功应用于图像编辑,但在视频编辑领域相关工作极少。我们首次提出了基于扩散模型的方法,能够对通用视频进行基于文本的运动与外观编辑。本方法在推理阶段利用视频扩散模型,将原始视频的低分辨率时空信息与新合成的、用于对齐引导文本提示的高分辨率信息相结合。为保持对原始视频的高保真度,需保留其部分高分辨率信息,为此我们增设了在原始视频上微调模型的预处理阶段,显著提升了保真度。通过引入一种结合全时间注意力与时间注意力掩码联合微调的新型混合目标,我们改进了运动可编辑性。此外,我们提出了一种图像动画生成新框架:先通过复制和透视几何投影等简单图像处理操作将图像转换为粗糙视频,再使用通用视频编辑器进行动画化。作为进一步应用,本方法还可用于主体驱动视频生成。大量定性与定量实验展示了本方法卓越的编辑能力,并确立了其相较于基线方法的性能优势。