Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this paper, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained T2I 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive spatial-temporal modeling unit is formulated. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, thereby enhancing the temporal consistency of the generated videos while preserving the capacity for video content manipulation. Quantitative experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.
翻译:大规模文本到图像(T2I)扩散模型已被扩展用于文本引导的视频编辑,展现出令人印象深刻的零样本视频编辑性能。然而,由于视频的时间特性尚未得到忠实建模,生成的视频通常会出现空间不规则性和时间不一致性。本文提出一种简洁而有效的时序一致视频编辑(TCVE)方法,以缓解文本引导视频编辑中的时序不一致挑战。除了利用预训练的T2I 2D Unet进行空间内容操控外,我们还构建了专用的时序Unet架构,以忠实捕捉输入视频序列的时间连贯性。此外,为建立空间聚焦组件与时间聚焦组件之间的连贯性和相互关联性,我们设计了一个统一的时空建模单元。该单元有效连接了时序Unet与预训练的2D Unet,从而在保持视频内容操控能力的同时,增强了生成视频的时序一致性。定量实验结果和可视化结果表明,TCVE在视频时序一致性和视频编辑能力两方面均达到了最先进水平,超越了该领域的现有基准。