We introduce InVi, an approach for inserting or replacing objects within videos (referred to as inpainting) using off-the-shelf, text-to-image latent diffusion models. InVi targets controlled manipulation of objects and blending them seamlessly into a background video unlike existing video editing methods that focus on comprehensive re-styling or entire scene alterations. To achieve this goal, we tackle two key challenges. Firstly, for high quality control and blending, we employ a two-step process involving inpainting and matching. This process begins with inserting the object into a single frame using a ControlNet-based inpainting diffusion model, and then generating subsequent frames conditioned on features from an inpainted frame as an anchor to minimize the domain gap between the background and the object. Secondly, to ensure temporal coherence, we replace the diffusion model's self-attention layers with extended-attention layers. The anchor frame features serve as the keys and values for these layers, enhancing consistency across frames. Our approach removes the need for video-specific fine-tuning, presenting an efficient and adaptable solution. Experimental results demonstrate that InVi achieves realistic object insertion with consistent blending and coherence across frames, outperforming existing methods.
翻译:本文提出InVi方法,利用现成文本到图像的潜在扩散模型,在视频中插入或替换对象(即修复任务)。与现有专注于整体风格重塑或全场景变换的视频编辑方法不同,InVi旨在实现对对象的精确操控,并将其无缝融合至背景视频中。为实现这一目标,我们攻克了两个关键挑战。首先,为实现高质量的对象控制与融合,我们采用包含修复与匹配的两阶段流程:先通过基于ControlNet的修复扩散模型将对象插入单帧图像,随后以修复帧的特征作为锚点生成后续帧,从而最小化背景与对象之间的域差异。其次,为确保时序一致性,我们将扩散模型的自注意力层替换为扩展注意力层,并以锚帧特征作为这些层的键值与数值,从而增强帧间一致性。本方法无需针对特定视频进行微调,提供了高效且适应性强的解决方案。实验结果表明,InVi能够实现逼真的对象插入效果,在帧间融合与一致性方面均优于现有方法。