Even though large-scale text-to-image generative models show promising performance in synthesizing high-quality images, applying these models directly to image editing remains a significant challenge. This challenge is further amplified in video editing due to the additional dimension of time. This is especially the case for editing real-world videos as it necessitates maintaining a stable structural layout across frames while executing localized edits without disrupting the existing content. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention for new feature injection and relaxing spatial-temporal attention of the editing object, we achieve localized shape-wise edit along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional information. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.
翻译:摘要:尽管大规模文本到图像生成模型在合成高质量图像方面展现出令人瞩目的性能,但将这些模型直接应用于图像编辑仍面临重大挑战。由于时间维度的引入,这一挑战在视频编辑中被进一步放大——尤其是在编辑真实世界视频时,既要保持各帧之间的稳定结构布局,又要在不破坏现有内容的前提下进行局部编辑。本文提出RealCraft,一种基于注意力控制的零样本真实世界视频编辑方法。通过交换交叉注意力以实现新特征注入,并放宽编辑对象的时空注意力约束,我们实现了局部形状级编辑并增强了时间一致性。该模型直接使用Stable Diffusion,无需额外信息即可运行。我们在从多种视频上展示了这种零样本注意力控制方法,成功实现了长达64帧视频中的形状级、时间一致且无需参数调整的编辑。