Large text-to-image diffusion models have achieved remarkable success in generating diverse, high-quality images. Additionally, these models have been successfully leveraged to edit input images by just changing the text prompt. But when these models are applied to videos, the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we propose InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference in features obtained with source and edit prompts from U-Net residual blocks of decoder layers. When these are combined with injected attention features, it becomes feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in a fine-grained manner with mask extraction and attention fusion, which cut the edited part from the source and paste it into the denoising pipeline for the editing prompt. Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training. We demonstrated complex concept editing with a generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness of existing methods in rendering high-quality and temporally consistent videos.
翻译:大型文本到图像扩散模型在生成多样化、高质量图像方面取得了显著成功。此外,这些模型已成功应用于仅通过修改文本提示即可编辑输入图像。但当将这些模型应用于视频时,主要挑战在于确保帧间的时间一致性和连贯性。在本文中,我们提出InFusion框架,这是一个利用大型预训练图像扩散模型进行零样本文本驱动视频编辑的框架。该框架特别支持对编辑提示中提及的多个概念进行像素级控制编辑。具体而言,我们注入从源提示和编辑提示的解码器层U-Net残差块中获得的特征差异。当这些与注入的注意力特征结合时,能够查询源内容,并在注入未编辑部分的同时缩放编辑概念。通过掩码提取和注意力融合进一步实现细粒度的编辑控制,该方法从源图像中切割编辑部分,并将其粘贴到编辑提示的去噪流程中。我们的框架是一种低成本的替代方案,无需训练即可实现单次调优模型的编辑。我们使用LoRA通过通用图像模型(Stable Diffusion v1.5)演示了复杂概念编辑。该适配与所有现有图像扩散技术兼容。大量实验结果表明,现有方法在生成高质量且时间一致的视频方面具有有效性。