Despite the typical inversion-then-editing paradigm using text-to-image (T2I) models has demonstrated promising results, directly extending it to text-to-video (T2V) models still suffers severe artifacts such as color flickering and content distortion. Consequently, current video editing methods primarily rely on T2I models, which inherently lack temporal-coherence generative ability, often resulting in inferior editing results. In this paper, we attribute the failure of the typical editing paradigm to: 1) Tightly Spatial-temporal Coupling. The vanilla pivotal-based inversion strategy struggles to disentangle spatial-temporal information in the video diffusion model; 2) Complicated Spatial-temporal Layout. The vanilla cross-attention control is deficient in preserving the unedited content. To address these limitations, we propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion. Furthermore, we introduce a self-attention control strategy to maintain higher fidelity for precise partial content editing. Experimental results demonstrate that our method (termed VideoDirector) effectively harnesses the powerful temporal generation capabilities of T2V models, producing edited videos with state-of-the-art performance in accuracy, motion smoothness, realism, and fidelity to unedited content.
翻译:尽管基于文本到图像(T2I)模型的典型“反演-再编辑”范式已展现出良好效果,但将其直接扩展至文本到视频(T2V)模型仍会产生严重的伪影,如色彩闪烁与内容扭曲。因此,现有视频编辑方法主要依赖T2I模型,而这类模型本质上缺乏时序一致性生成能力,常导致编辑效果欠佳。本文将该典型编辑范式的失效归因于:1)紧密的时空耦合——基于关键帧的原始反演策略难以解耦视频扩散模型中的时空信息;2)复杂的时空布局——原始的交叉注意力控制在保护未编辑内容方面存在不足。为突破这些局限,我们提出时空解耦引导(STDG)与多帧空文本优化策略,为更精确的关键帧反演提供关键时序线索。此外,我们引入自注意力控制策略,在精确局部内容编辑中保持更高的保真度。实验结果表明,我们的方法(称为VideoDirector)能有效利用T2V模型强大的时序生成能力,所生成的编辑视频在准确性、运动平滑度、真实感及对未编辑内容的保真度方面均达到最先进的性能水平。