Large-scale text-to-image diffusion models achieve unprecedented success in image generation and editing. However, how to extend such success to video editing is unclear. Recent initial attempts at video editing require significant text-to-video data and computation resources for training, which is often not accessible. In this work, we propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos. Code will be made available at \url{https://github.com/baaivision/vid2vid-zero}.
翻译:大规模文本到图像扩散模型在图像生成与编辑领域取得了前所未有的成功。然而,如何将此类成功迁移至视频编辑仍不明确。近期视频编辑的初步尝试需要大量文本-视频数据与计算资源进行训练,这往往难以实现。本文提出vid2vid-zero——一种简单高效的零样本视频编辑方法。我们的vid2vid-zero利用现成图像扩散模型,无需任何视频训练数据。方法核心包括:用于文本-视频对齐的空文本反转模块、用于时序一致性的跨帧建模模块、用于保持原始视频保真度的空间正则化模块。无需训练,我们利用注意力机制的动态特性在测试阶段实现双向时序建模。实验与分析表明,该方法在现实视频的属性、主体、场景等编辑任务中展现出可喜成果。代码将开源至 \url{https://github.com/baaivision/vid2vid-zero}。