This paper presents a video inversion approach for zero-shot video editing, which models the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing, which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches, we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion, which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion, which is more friendly for temporal consistency during reconstruction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods. Project page: https://stem-inv.github.io/page/.
翻译:本文提出了一种用于零样本视频编辑的视频反演方法,该方法在反演过程中使用低秩表示对输入视频进行建模。现有的视频编辑方法通常在编辑前应用典型的2D DDIM反演或朴素的时空DDIM反演,这些方法利用每帧的时变表示来推导噪声潜变量。与大多数现有方法不同,我们提出了一种时空期望最大化(STEM)反演方法,该方法在期望最大化框架下对密集视频特征进行建模,并迭代地估计一个更紧凑的基集合来表示整个视频。每一帧在反演时都采用固定且全局的表示,这在重建和编辑过程中更有利于保持时间一致性。大量的定性和定量实验表明,我们的STEM反演方法可以在两种最先进的视频编辑方法上实现一致的性能提升。项目页面:https://stem-inv.github.io/page/。