We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment.
翻译:我们提出了一种新颖且高效的基于文本的视频到视频编辑方法,该方法无需对每个视频和每个模型进行资源密集型的微调。该方法的核心理念在于一个专门为视频到视频迁移任务定制的合成配对视频数据集。受Instruct Pix2Pix通过编辑指令进行图像迁移的启发,我们将这一范式扩展至视频领域。通过将"提示到提示"(Prompt-to-Prompt)方法拓展到视频中,我们高效地生成了配对样本,每个样本包含一个输入视频及其编辑后的对应视频。此外,我们引入了长视频采样校正(Long Video Sampling Correction)机制,以确保批次间生成长视频的一致性。我们的方法超越了当前如Tune-A-Video等方法,标志着基于文本的视频到视频编辑领域取得了重大进展,并为未来的探索和部署开辟了令人兴奋的途径。