Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance. Video results are available on our webpage: https://motion-prompting.github.io/
翻译:运动控制对于生成富有表现力和吸引力的视频内容至关重要;然而,现有的大多数视频生成模型主要依赖文本提示进行控制,难以捕捉动态动作和时间构成的细微差别。为此,我们训练了一个以时空稀疏或稠密运动轨迹为条件的视频生成模型。与先前的运动条件设定工作相比,这种灵活的表示方法能够编码任意数量的轨迹、特定于对象的或全局场景的运动,以及时间上稀疏的运动;由于其灵活性,我们将这种条件设定称为运动提示。虽然用户可以直接指定稀疏轨迹,但我们也展示了如何将高级用户请求转化为详细的、半稠密的运动提示,我们将此过程称为运动提示扩展。我们通过各种应用展示了我们方法的通用性,包括相机和物体运动控制、与图像"交互"、运动迁移和图像编辑。我们的结果展示了涌现行为,例如逼真的物理效果,这表明运动提示在探测视频模型以及与未来生成式世界模型交互方面具有潜力。最后,我们进行了定量评估、开展了一项人类研究,并展示了强大的性能。视频结果可在我们的网页上查看:https://motion-prompting.github.io/