Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistency edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal VIdeo Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, the foundation of VIA is a novel test-time editing adaptation method, which adapts a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that adapts consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potentials for advanced video editing tasks over long video sequences.
翻译:视频编辑作为数字媒体的基石,广泛应用于娱乐、教育及专业传播领域。然而,现有方法往往忽视了对全局与局部上下文进行全面理解的必要性,导致在时空维度上产生不准确且不一致的编辑效果,尤其对于长视频而言。本文提出VIA,一个面向全局与局部视频编辑的统一时空视频自适应框架,将分钟级长视频的一致性编辑推向新的极限。首先,为确保单帧内的局部一致性,VIA的基础是一种新颖的测试时编辑自适应方法:该方法通过适配预训练图像编辑模型以提升潜在编辑方向与文本指令间的一致性,并通过对掩码隐变量进行自适应实现精确的局部控制。此外,为保持视频序列的全局一致性,我们引入时空自适应机制——该方法在关键帧中适配一致的注意力变量,并通过策略性扩展至整个序列以实现编辑效果。大量实验表明,相较于基线方法,我们的VIA框架能够生成更忠实于源视频、在时空上下文中更连贯、且局部控制更精确的编辑结果。更重要的是,我们证明VIA可在数分钟内实现长视频的一致性编辑,为长视频序列的高级编辑任务开辟了新的可能性。