The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works.
翻译:基于扩散的生成模型在文本驱动的图像生成中取得了显著成功。然而,由于生成过程包含巨大的随机性,将此类模型应用于真实世界的视觉内容编辑(尤其是视频编辑)仍面临挑战。本文提出FateZero,一种面向真实世界视频的零样本文本驱动编辑方法,无需针对每个提示进行训练或使用特定掩码。为保持编辑一致性,我们基于预训练模型提出多项技术。首先,与直接的DDIM反演技术不同,我们的方法在反演过程中捕获中间注意力图,有效保留结构与运动信息。这些注意力图直接融合到编辑过程中,而非在去噪阶段生成。为进一步减少源视频的语义泄漏,我们通过源提示中的交叉注意力特征生成混合掩码,并以此融合自注意力。此外,我们通过引入时空注意力改造去噪UNet中的自注意力机制,确保帧间一致性。尽管方法简洁,但本文首次展示了从预训练文本到图像模型实现零样本文本驱动视频风格与局部属性编辑的能力。基于文本到视频模型,我们还具备更优的零样本形状感知编辑能力。大量实验证明,与先前工作相比,我们的方法在时间一致性和编辑能力上均表现优越。