Large text-to-image diffusion models have achieved remarkable success in generating diverse high-quality images in alignment with text prompt used for editing the input image. But, when these models applied to video the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we proposed InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with the pixel level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference of features obtained with source and edit prompt from U-Net residual blocks in decoder layers, this when combined with injected attention features make it feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in fine-grained manner with mask extraction and attention fusion strategy which cuts the edited part from source and paste it into the denoising pipeline for editing prompt. Our framework is a low cost alternative of one-shot tuned models for editing since it does not require training. We demonstrated the complex concept editing with generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness over existing methods in rendering high-quality and temporally consistent videos.
翻译:大规模文生图扩散模型在根据文本提示生成多样化高质量图像以及编辑输入图像方面取得了显著成功。然而,将这些模型应用于视频时,主要挑战在于确保帧间的时间一致性和连贯性。本文提出InFusion框架,一种利用预训练大规模图像扩散模型的零样本文本驱动视频编辑方法。该框架特别支持对编辑提示中提及的多个概念进行像素级控制。具体而言,我们将解码器层中U-Net残差块在源提示和编辑提示下获得的特征差异进行注入,该操作与注入的注意力特征相结合,使得能够查询源内容并缩放编辑概念,同时注入未编辑部分。进一步通过掩码提取和注意力融合策略实现细粒度编辑控制,该策略从源中裁剪编辑部分并将其粘贴到编辑提示的去噪流程中。我们的框架无需训练,是单次微调模型用于编辑的低成本替代方案。我们展示了使用LoRA的通用图像模型(Stable Diffusion v1.5)完成复杂概念编辑,该适配方法与所有现有图像扩散技术兼容。大量实验结果表明,该方法在生成高质量且时间一致性视频方面优于现有方法。