Pix2Video: Video Editing using Image Diffusion

Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.

翻译：图像扩散模型在大规模图像集合上训练而成，已成为质量和多样性方面最通用的图像生成模型。它支持真实图像的反演和条件式（如文本）生成，因而在高质量图像编辑应用中极具吸引力。我们研究如何利用此类预训练图像模型进行文本引导的视频编辑。关键挑战在于实现目标编辑的同时，保持源视频的内容不变。我们的方法包含两个简单步骤：首先，使用预训练的结构引导（如深度）图像扩散模型对锚定帧进行文本引导编辑；随后，在关键步骤中，通过自注意力特征注入逐步将改动传播至后续帧，以适配扩散模型的核心去噪步骤。接着，通过调整当前帧的潜在编码来巩固改动，再继续后续处理。我们的方法无需训练，且能泛化至多种编辑任务。我们通过大量实验证明了该方法的有效性，并与四项此前及同期研究（见ArXiv）进行了对比。结果表明，无需计算密集型预处理或视频专用微调，即可实现逼真的文本引导视频编辑。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【CVPR 2022】可控图像合成与编辑的合成生成先验学习，SemanticStyleGAN: Learning Compositonal Generative Priors for Controllable Image Synthesis and Editing

专知会员服务

23+阅读 · 2022年3月3日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日