Editing videos with textual guidance has garnered popularity due to its streamlined process which mandates users to solely edit the text prompt corresponding to the source video. Recent studies have explored and exploited large-scale text-to-image diffusion models for text-guided video editing, resulting in remarkable video editing capabilities. However, they may still suffer from some limitations such as mislocated objects, incorrect number of objects. Therefore, the controllability of video editing remains a formidable challenge. In this paper, we aim to challenge the above limitations by proposing a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method. Specially, to align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD) to refocus the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video. In particular, to faithfully preserve the invariant region content with less border artifacts, we propose an Invariant Region-guided Joint Sampling (IRJS) strategy to mitigate the intrinsic sampling errors w.r.t the invariant regions at each denoising timestep and constrain the generated content to be harmonized with the invariant region content. Experimental results verify that ReAtCo consistently improves the controllability of video diffusion editing and achieves superior video editing performance.
翻译:基于文本指导的视频编辑因其简化流程而广受欢迎,该流程仅要求用户编辑与源视频对应的文本提示。近期研究探索并利用大规模文本到图像扩散模型进行文本引导的视频编辑,展现出卓越的视频编辑能力。然而,现有方法仍存在目标物体定位偏差、物体数量错误等局限。因此,视频编辑的可控性仍是严峻挑战。本文旨在通过提出重注意力可控视频扩散编辑方法应对上述局限。具体而言,为在无需训练的条件下使目标物体的空间位置与编辑后的文本提示对齐,我们提出重注意力扩散机制,在去噪阶段重新聚焦编辑文本提示与目标视频之间的交叉注意力激活响应,从而生成空间位置对齐且语义保真度高的编辑视频。特别地,为忠实保留不变区域内容并减少边界伪影,我们提出不变区域引导联合采样策略,以减轻每个去噪步中针对不变区域的内在采样误差,并约束生成内容与不变区域内容相协调。实验结果验证,ReAtCo 能持续提升视频扩散编辑的可控性,并实现卓越的视频编辑性能。