Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.
翻译:生成运动控制视频——用户指定动作驱动物理合理的场景动态,并支持自由选择视角——需要两种能力:(1)解耦运动控制,允许用户分别控制物体运动并调整相机视角;(2)运动因果性,确保用户驱动动作能够触发其他物体的连贯响应,而非仅移动像素。现有方法在这两方面均存在不足:它们将相机与物体运动纠缠为单一跟踪信号,并将运动视为运动学位移而忽视物体间的因果关系建模。我们提出MoRight,一个通过解耦运动建模同时解决上述局限的统一框架。物体运动在规范静态视角中指定,并通过时序交叉视角注意力迁移至任意目标相机视角,实现相机与物体控制的解耦。我们进一步将运动分解为主动(用户驱动)与被动(结果)组件,训练模型从数据中学习运动因果性。推理时,用户可提供主动运动并由MoRight预测结果(正向推理),或指定期望的被动效果并由MoRight恢复合理的驱动动作(反向推理),同时自由调整相机视角。在三个基准上的实验表明,该方法在生成质量、运动可控性和交互感知方面均达到最优性能。