The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works.
翻译:基于扩散的生成模型在文本驱动的图像生成领域取得了显著成功。然而,由于生成过程中存在大量随机性,此类模型在真实视觉内容编辑(尤其是视频编辑)中的应用仍面临挑战。本文提出FateZero,一种无需每轮提示训练或特定掩码的零样本文本驱动真实视频编辑方法。为实现视频编辑的一致性,我们基于预训练模型提出了多项技术。首先,与直接的DDIM反转技术不同,我们的方法在反转过程中捕获中间注意力图,有效保留了结构信息和运动信息。这些注意力图在编辑过程中直接融合,而非在去噪阶段生成。为进一步减少源视频的语义泄露,我们通过源提示的交叉注意力特征获取融合掩码,并将其与自注意力机制融合。此外,我们在去噪UNet中对自注意力机制进行改进,引入时空注意力以确保帧间一致性。尽管方法简洁,但我们是首个展示基于训练好的文本到图像模型实现零样本文本驱动视频风格与局部属性编辑能力的工作。同时,基于文本到视频模型,我们具备更优的零样本形状感知编辑能力。大量实验表明,我们的方法在时间一致性和编辑能力上均优于先前工作。