We introduce a novel and efficient approach for text-based video-to-video editing that eliminates the need for resource-intensive per-video-per-model finetuning. At the core of our approach is a synthetic paired video dataset tailored for video-to-video transfer tasks. Inspired by Instruct Pix2Pix's image transfer via editing instruction, we adapt this paradigm to the video domain. Extending the Prompt-to-Prompt to videos, we efficiently generate paired samples, each with an input video and its edited counterpart. Alongside this, we introduce the Long Video Sampling Correction during sampling, ensuring consistent long videos across batches. Our method surpasses current methods like Tune-A-Video, heralding substantial progress in text-based video-to-video editing and suggesting exciting avenues for further exploration and deployment.
翻译:我们提出了一种新颖且高效的基于文本的视频到视频编辑方法,该方法无需针对每个视频和每个模型进行资源密集型的微调。该方法的核心是一个专为视频到视频迁移任务定制的合成配对视频数据集。受Instruct Pix2Pix通过编辑指令进行图像迁移的启发,我们将这一范式扩展到视频领域。通过将Prompt-to-Prompt扩展到视频生成,我们高效地生成配对样本,每个样本包含一个输入视频及其编辑后的对应版本。此外,我们引入了长视频采样校正机制,确保跨批次生成的长视频具有一致性。我们的方法超越了Tune-A-Video等现有技术,标志着基于文本的视频到视频编辑取得了实质性进展,并为进一步探索和部署开辟了令人兴奋的新方向。