In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation indicates that AnyV2V significantly outperforms other baseline methods in automatic and human evaluations by significant margin, maintaining visual consistency with the source video while achieving high-quality edits across all the editing tasks.
翻译:在基于生成模型的数字内容创作动态领域中,现有视频编辑模型仍无法提供用户期望的质量与控制水平。先前的视频编辑工作或基于图像生成模型以零样本方式扩展,或需要大量微调,这阻碍了流畅视频编辑的生产。此外,这些方法常依赖文本输入作为编辑指导,导致歧义并限制了可执行的编辑类型。针对这些挑战,我们提出AnyV2V——一种创新的无调参范式,将视频编辑简化为两个主要步骤:(1)使用现成图像编辑模型修改首帧,(2)通过时序特征注入利用现有图像到视频生成模型生成编辑后的视频。AnyV2V可借助任意现有图像编辑工具支持广泛的视频编辑任务,包括基于提示的编辑、基于参考的风格迁移、主体驱动编辑及身份操控——这些均无法通过先前方法实现。AnyV2V同时支持任意长度视频。评估表明,AnyV2V在自动评估与人工评估中均显著优于其他基线方法,在保持与源视频视觉一致性的同时,在所有编辑任务中实现高质量编辑。