Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
翻译:图像扩散模型在大规模图像集合上训练而成,已成为质量和多样性方面最通用的图像生成模型。它支持真实图像的反演和条件式(如文本)生成,因而在高质量图像编辑应用中极具吸引力。我们研究如何利用此类预训练图像模型进行文本引导的视频编辑。关键挑战在于实现目标编辑的同时,保持源视频的内容不变。我们的方法包含两个简单步骤:首先,使用预训练的结构引导(如深度)图像扩散模型对锚定帧进行文本引导编辑;随后,在关键步骤中,通过自注意力特征注入逐步将改动传播至后续帧,以适配扩散模型的核心去噪步骤。接着,通过调整当前帧的潜在编码来巩固改动,再继续后续处理。我们的方法无需训练,且能泛化至多种编辑任务。我们通过大量实验证明了该方法的有效性,并与四项此前及同期研究(见ArXiv)进行了对比。结果表明,无需计算密集型预处理或视频专用微调,即可实现逼真的文本引导视频编辑。