Video editing serves as a fundamental pillar of digital media, spanning applications in entertainment, education, and professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistent edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal Video Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, we designed test-time editing adaptation to adapt a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that recursively gather consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potential for advanced video editing tasks over long video sequences.
翻译:视频编辑作为数字媒体的基础支柱,广泛应用于娱乐、教育和专业传播领域。然而,现有方法往往忽视全面理解全局与局部上下文的需求,导致时空维度上的编辑结果不准确且不一致,尤其对于长视频而言。本文提出VIA,一种面向全局与局部视频编辑的统一时空视频适配框架,将分钟级长视频的一致性编辑推向新的边界。首先,为确保单帧内的局部一致性,我们设计了测试时编辑适配机制:通过适配预训练图像编辑模型以提升潜在编辑方向与文本指令间的一致性,并适配掩码隐变量以实现精确的局部控制。此外,为保持视频序列的全局一致性,我们引入时空适配机制,递归地收集关键帧中的一致注意力变量,并策略性地将其应用于整个序列以实现编辑效果。大量实验表明,相较于基线方法,我们的VIA框架生成的编辑结果更忠实于源视频内容,在时空上下文中更具连贯性,且在局部控制上更为精确。更重要的是,我们证明了VIA能在数分钟内实现长视频的一致性编辑,为长视频序列的高级编辑任务开辟了新的可能性。