We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.
翻译:我们提出了一种新的文本驱动运动迁移方法——合成一段符合描述目标物体和场景的输入文本提示,同时保留输入视频的运动和场景布局的视频。先前的方法局限于在相同或相近物体类别内的两个主体之间迁移运动,且仅适用于有限领域(例如人类)。本文考虑了一个更具挑战性的场景:目标物体与源物体在形状和细粒度运动特征上差异显著(例如,将跳跃的狗转换为海豚)。为此,我们利用一个预训练且固定的文本到视频扩散模型,该模型为我们提供了生成和运动先验。我们方法的核心是直接从模型中导出的新的时空特征损失。该损失引导生成过程保留输入视频的整体运动,同时使目标物体在形状和细粒度运动特征上符合要求。