Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. Project web-page at https://videdit.github.io
翻译:近期,基于扩散的生成模型在图像生成与编辑领域取得了显著成功。然而,其在视频编辑中的应用仍面临关键限制。本文提出VidEdit——一种新颖的零样本文本驱动视频编辑方法,可确保强时间与空间一致性。首先,我们提出结合图谱(atlas)基础与预训练的文本到图像扩散模型,提供一种免训练且高效的编辑方法,该方法天然满足时间平滑性。其次,我们利用现成的全景分割器与边缘检测器,并调整其用于条件扩散的图谱编辑。这能在严格保留原始视频结构的同时,对目标区域实现精细的空间控制。定量与定性实验表明,在DAVIS数据集上,VidEdit在语义保真度、图像保留度及时间一致性指标上均优于现有最先进方法。基于该框架,单个视频的处理仅需约一分钟,且可通过单一文本提示生成多种兼容性编辑。项目网页:https://videdit.github.io