We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment.
翻译:我们提出了一种新颖且高效的基于文本的视频到视频编辑方法,无需对每个视频进行资源密集型的逐模型微调。该方法的核心是一个为视频到视频迁移任务量身定制的合成配对视频数据集。受Instruct Pix2Pix通过编辑指令进行图像迁移的启发,我们将这一范式扩展至视频领域。通过将Prompt-to-Prompt方法推广到视频,我们高效地生成了包含输入视频及其编辑后版本的配对样本。此外,我们在采样过程中引入了长视频采样校正,以确保跨批次的长视频一致性。我们的方法超越了当前如Tune-A-Video等技术,标志着基于文本的视频到视频编辑取得了显著进展,并为未来的探索与应用开辟了令人兴奋的新方向。