The burgeoning field of text-based video generation (T2V) has reignited significant interest in the research of controllable video editing. Although pre-trained T2V-based editing models have achieved efficient editing capabilities, current works are still plagued by two major challenges. Firstly, the inherent limitations of T2V models lead to content inconsistencies and motion discontinuities between frames. Secondly, the notorious issue of over-editing significantly disrupts areas that are intended to remain unaltered. To address these challenges, our work aims to explore a robust video-based editing paradigm based on score distillation. Specifically, we propose an Adaptive Sliding Score Distillation strategy, which not only enhances the stability of T2V supervision but also incorporates both global and local video guidance to mitigate the impact of generation errors. Additionally, we modify the self-attention layers during the editing process to further preserve the key features of the original video. Extensive experiments demonstrate that these strategies enable us to effectively address the aforementioned challenges, achieving superior editing performance compared to existing state-of-the-art methods.
翻译:基于文本的视频生成(T2V)领域的蓬勃发展,重新激发了人们对可控视频编辑研究的浓厚兴趣。尽管基于预训练T2V的编辑模型已实现了高效的编辑能力,但当前研究仍受两大挑战困扰。首先,T2V模型固有的局限性导致帧间内容不一致和运动不连续。其次,臭名昭著的过编辑问题严重破坏了本应保持不变的区域。为应对这些挑战,本研究旨在探索一种基于分数蒸馏的鲁棒视频编辑范式。具体而言,我们提出了一种自适应滑动分数蒸馏策略,该策略不仅增强了T2V监督的稳定性,还结合了全局与局部视频引导以减轻生成误差的影响。此外,我们在编辑过程中修改了自注意力层,以进一步保留原始视频的关键特征。大量实验表明,这些策略使我们能够有效解决上述挑战,相比现有最先进方法实现了更优的编辑性能。