Text-driven diffusion-based video editing presents a unique challenge not encountered in image editing literature: establishing real-world motion. Unlike existing video editing approaches, here we focus on score distillation sampling to circumvent the standard reverse diffusion process and initiate optimization from videos that already exhibit natural motion. Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match space-time self-similarities of the original video and the edited video during the score distillation. Thanks to the use of score distillation, our approach is model-agnostic, which can be applied for both cascaded and non-cascaded video diffusion frameworks. Through extensive comparisons with leading methods, our approach demonstrates its superiority in altering appearances while accurately preserving the original structure and motion.
翻译:基于文本驱动的扩散模型视频编辑面临图像编辑领域未曾遇到的独特挑战:建立真实世界运动。与现有视频编辑方法不同,本文聚焦于分数蒸馏采样,以规避标准反向扩散过程,并从已呈现自然运动的视频开始优化。我们的分析表明,虽然视频分数蒸馏能有效引入目标文本指示的新内容,但也可能导致显著的结构与运动偏差。为应对此问题,我们提出在分数蒸馏过程中匹配原始视频与编辑视频的时空自相似性。得益于分数蒸馏的使用,我们的方法具备模型无关性,可同时适用于级联与非级联视频扩散框架。通过与主流方法的广泛对比,我们的方法在改变外观的同时能精确保持原始结构与运动,展现出显著优势。