Although large-scale text-to-image generative models have shown promising performance in synthesizing high-quality images, directly applying these models to image editing remains a significant challenge. This challenge is further amplified in video editing due to the additional dimension of time. Especially for editing real videos as it necessitates maintaining a stable semantic layout across the frames while executing localized edits precisely without disrupting the existing backgrounds. In this paper, we propose RealCraft, an attention-control-based method for zero-shot editing in real videos. By employing the object-centric manipulation of cross-attention between prompts and frames and spatial-temporal attention within the frames, we achieve precise shape-wise editing along with enhanced consistency. Our model can be used directly with Stable Diffusion and operates without the need for additional localized information. We showcase our zero-shot attention-control-based method across a range of videos, demonstrating localized, high-fidelity, shape-precise and time-consistent editing in videos of various lengths, up to 64 frames.
翻译:尽管大规模文本到图像生成模型在合成高质量图像方面展现出良好性能,但直接将这些模型应用于图像编辑仍面临重大挑战。这一挑战在视频编辑中因时间维度的加入而进一步加剧,尤其是在编辑真实视频时,需在帧间维持稳定的语义布局,同时精确执行局部编辑且不破坏现有背景。本文提出RealCraft方法,一种基于注意力控制的真实视频零样本编辑技术。通过利用提示与帧之间交叉注意力的对象中心化操作,以及帧内时空注意力机制,实现了精确的形状级编辑与增强的一致性。我们的模型可直接与Stable Diffusion配合使用,无需额外局部信息。我们在多种视频上展示了这种基于零样本注意力控制的方法,实现了对长达64帧的不同长度视频的局部、高保真、形状精确且时间一致的编辑。