The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works.
翻译:基于扩散的生成模型在文本驱动的图像生成领域取得了显著成功。然而,由于生成过程存在巨大随机性,将此类模型应用于真实视觉内容编辑(尤其是视频编辑)仍面临挑战。本文提出FateZero——一种无需逐提示训练或特定掩码的零样本真实视频文本编辑方法。为实现视频一致性编辑,我们基于预训练模型提出多项技术。首先,不同于直接采用DDIM逆推技术,我们的方法在逆推过程中捕获中间注意力图,这些图有效保留了结构与运动信息。在编辑过程中,这些注意力图直接融合而非在去噪时生成。为最小化源视频的语义泄露,我们利用源提示的交叉注意力特征生成混合掩码,进一步融合自注意力。此外,我们通过引入时空注意力对去噪UNet中的自注意力机制进行改良,以确保帧间一致性。尽管方法简洁,本工作首次展示了从训练好的文本到图像模型实现零样本文本驱动视频风格与局部属性编辑的能力。基于文本到视频模型,我们还具备更优的零样本形状感知编辑能力。大量实验表明,我们的方法在时间一致性和编辑能力上均优于先前工作。