InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing

Large text-to-image diffusion models have achieved remarkable success in generating diverse, high-quality images. Additionally, these models have been successfully leveraged to edit input images by just changing the text prompt. But when these models are applied to videos, the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we propose InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference in features obtained with source and edit prompts from U-Net residual blocks of decoder layers. When these are combined with injected attention features, it becomes feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in a fine-grained manner with mask extraction and attention fusion, which cut the edited part from the source and paste it into the denoising pipeline for the editing prompt. Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training. We demonstrated complex concept editing with a generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness of existing methods in rendering high-quality and temporally consistent videos.

翻译：大型文本到图像扩散模型在生成多样化、高质量图像方面取得了显著成功。此外，这些模型已成功应用于仅通过修改文本提示即可编辑输入图像。但当将这些模型应用于视频时，主要挑战在于确保帧间的时间一致性和连贯性。在本文中，我们提出InFusion框架，这是一个利用大型预训练图像扩散模型进行零样本文本驱动视频编辑的框架。该框架特别支持对编辑提示中提及的多个概念进行像素级控制编辑。具体而言，我们注入从源提示和编辑提示的解码器层U-Net残差块中获得的特征差异。当这些与注入的注意力特征结合时，能够查询源内容，并在注入未编辑部分的同时缩放编辑概念。通过掩码提取和注意力融合进一步实现细粒度的编辑控制，该方法从源图像中切割编辑部分，并将其粘贴到编辑提示的去噪流程中。我们的框架是一种低成本的替代方案，无需训练即可实现单次调优模型的编辑。我们使用LoRA通过通用图像模型（Stable Diffusion v1.5）演示了复杂概念编辑。该适配与所有现有图像扩散技术兼容。大量实验结果表明，现有方法在生成高质量且时间一致的视频方面具有有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日