This paper presents Video-P2P, a novel framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models, there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically, we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. For attention control, we introduce a novel decoupled-guidance strategy, which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability, while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications, including word swap, prompt refinement, and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches.
翻译:本文提出Video-P2P,一种用于真实视频编辑的新型交叉注意力控制框架。尽管注意力控制已在预训练图像生成模型的图像编辑中展现出有效性,但目前尚无大规模视频生成模型公开可用。Video-P2P通过将图像生成扩散模型适配到多种视频编辑任务中来突破这一限制。具体而言,我们首先调优一个文本到集(Text-to-Set,T2S)模型以实现近似反演,然后优化一个共享无条件嵌入向量,在低内存消耗下实现精确视频反演。针对注意力控制,我们提出一种新的解耦引导策略,该策略对源提示词和目标提示词采用不同的引导方式。针对源提示词的优化无条件嵌入可提升重建能力,而针对目标提示词的初始化无条件嵌入则增强可编辑性。结合这两个分支的注意力图可实现精细编辑。这些技术设计支持多种文本驱动编辑应用,包括词语替换、提示词精炼和注意力重加权。Video-P2P在真实世界视频中表现优异,能在最佳保留原始姿态和场景的前提下生成新角色,显著优于先前方法。