We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.
翻译:我们提出了一种利用新生成的动态内容增强真实世界视频的方法。给定输入视频和描述期望内容的简单用户文本指令,我们的方法能够合成随时间推移与现有场景自然交互的动态对象或复杂场景效果。新内容的位置、外观和运动被无缝集成到原始素材中,同时考虑相机运动、遮挡以及与场景中其他动态对象的交互,从而产生连贯且逼真的输出视频。我们通过一个零样本、无需训练的框架实现这一目标,该框架利用预训练的文本到视频扩散Transformer来合成新内容,并利用预训练的视觉语言模型来详细构想增强后的场景。具体而言,我们引入了一种新颖的基于推理的方法,该方法在注意力机制内部操作特征,从而实现新内容的精确定位和无缝集成,同时保持原始场景的完整性。我们的方法是全自动的,仅需简单的用户指令。我们在应用于真实世界视频的广泛编辑上证明了其有效性,涵盖了涉及相机和对象运动的多种对象和场景。