Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.
翻译:文本到视频编辑旨在根据文本提示编辑源视频的视觉外观。该任务的主要挑战是确保编辑后视频的所有帧在视觉上保持一致性。近期大多数方法通过将U-Net中的2D空间注意力扩展为时空注意力,将先进的文本到图像扩散模型应用于此任务。尽管通过时空注意力可以引入时间上下文,但可能会为每个图像块引入无关信息,从而导致编辑视频的不一致性。本文首次将光流引入扩散模型U-Net的注意力模块中,以解决文本到视频编辑中的不一致性问题。我们的方法FLATTEN强制不同帧中沿同一流路径的图像块在注意力模块中相互关注,从而提升编辑视频的视觉一致性。此外,本方法无需训练,可无缝集成到任何基于扩散的文本到视频编辑方法中,改善其视觉一致性。在现有文本到视频编辑基准上的实验结果表明,我们所提出的方法实现了新的最先进性能。特别地,我们的方法在维持编辑视频的视觉一致性方面表现出色。