Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and efficient zero-shot video editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE could achieve a satisfactory trade-off between performance and efficiency. We will release our dataset and codebase to facilitate future researchers.
翻译:受图像扩散模型优越性能的启发,越来越多的研究者致力于将这些模型扩展至基于文本的视频编辑任务。然而,当前视频编辑任务主要面临高微调成本与有限生成能力之间的两难困境。与图像相比,我们推测视频在编辑过程中需要更多约束以保持时间一致性。为此,我们提出EVE——一种鲁棒且高效的零样本视频编辑方法。在深度图引导和时间一致性约束下,EVE能以可承受的计算与时间成本获得令人满意的视频编辑结果。此外,针对公开视频编辑数据集缺失导致公平对比困难的问题,我们构建了新的基准数据集ZVE-50。通过全面实验,我们验证了EVE能够在性能与效率之间取得令人满意的平衡。我们将公开数据集和代码库,以促进未来研究。