High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.
翻译:通过利用预训练的视频基础模型,高保真生成式视频编辑的质量已得到显著提升。然而,这些模型的计算成本是一个主要瓶颈,因为它们通常被设计为低效地处理完整的视频上下文,而无论修复掩码的大小如何,即使对于稀疏的局部编辑也是如此。本文提出EditCtrl,一种高效的视频修复控制框架,其计算仅聚焦于所需区域。我们的方法采用了一种新颖的局部视频上下文模块,该模块仅对掩码标记进行操作,从而产生与编辑规模成比例的计算成本。这种局部优先的生成过程由一个轻量级时序全局上下文嵌入器引导,该嵌入器以最小开销确保视频范围内的上下文一致性。EditCtrl不仅比最先进的生成式编辑方法计算效率高出10倍,而且与采用全注意力机制设计的方法相比,甚至提升了编辑质量。最后,我们展示了EditCtrl如何解锁新功能,包括基于文本提示的多区域编辑和自回归内容传播。