In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods. Furthermore, AnyV2V significantly outperformed these baselines in human evaluations, demonstrating notable improvements in visual consistency with the source video while producing high-quality edits across all editing tasks.
翻译:在使用生成模型进行数字内容创作的动态领域中,最先进的视频编辑模型仍未能提供用户所期望的质量与控制水平。以往的视频编辑工作要么以零样本方式从基于图像的生成模型扩展而来,要么需要进行大量的微调,这可能阻碍流畅视频编辑的生成。此外,这些方法通常依赖文本输入作为编辑指导,导致模糊性并限制了它们可执行的编辑类型。认识到这些挑战,我们引入了AnyV2V,一种新颖的无调优范式,旨在将视频编辑简化为两个主要步骤:(1) 使用现成的图像编辑模型修改第一帧,(2) 利用现有的图像到视频生成模型,通过时序特征注入生成编辑后的视频。AnyV2V可以利用任何现有的图像编辑工具,支持广泛的视频编辑任务,包括基于提示的编辑、基于参考的风格迁移、主体驱动的编辑和身份操纵,这些是先前方法无法实现的。AnyV2V还支持任意视频长度。我们的评估表明,AnyV2V取得了与其他基线方法相当的CLIP分数。此外,AnyV2V在人工评估中显著优于这些基线,在所有编辑任务中生成高质量编辑的同时,在与源视频的视觉一致性方面表现出显著的提升。