Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model's prediction at t-1. To leverage videos' temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in: https://smsd75.github.io/RFDM_page/
翻译:指令式视频编辑仅使用文本提示对输入视频进行编辑,实现了直观的自然语言控制。尽管进展迅速,但大多数方法仍需要固定长度的输入和大量计算资源。与此同时,自回归视频生成能够实现高效的可变长度合成,但在视频编辑领域的应用仍待深入探索。本文提出一种因果、高效的视频编辑模型,能够逐帧编辑可变长度视频。为实现高效性,我们从一个2D图像到图像(I2I)扩散模型出发,通过将时间步t的编辑条件建立在模型在t-1时刻的预测之上,将其适配为视频到视频(V2V)编辑任务。为利用视频的时间冗余特性,我们提出了一种新的I2I扩散前向过程公式,促使模型预测目标输出与先前预测之间的残差。我们将此模型称为残差流扩散模型(RFDM),其将去噪过程聚焦于连续帧之间的变化。此外,我们提出了一个新的基准测试,能够更准确地评估先进方法在编辑任务中的性能。通过在全局/局部风格迁移和对象移除的配对视频数据上进行训练,RFDM超越了基于I2I的方法,并与完全时空(3D)V2V模型性能相当,同时保持了与图像模型相当的计算量,且计算复杂度与输入视频长度无关。更多内容请访问:https://smsd75.github.io/RFDM_page/