Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous motion and subsequent dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video foundation models can act as both neural renderers and implicit physics simulators by learning interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines.
翻译:预测交互物体的动力学特性对人类和智能系统都至关重要。然而,现有方法局限于简化的玩具场景,缺乏对复杂现实环境的泛化能力。生成模型的最新进展使得基于干预的状态转移预测成为可能,但这些方法侧重于生成单一未来状态,忽略了相互作用产生的连续运动及后续动态过程。为填补这一空白,我们提出InterDyn——一种新颖的框架,能够在给定初始帧和编码驱动物体/执行器运动的控制信号条件下,生成交互动力学视频。我们的核心洞见在于:大规模视频基础模型通过学习海量视频数据中的交互动力学,可同时充当神经渲染器和隐式物理模拟器。为有效利用此能力,我们引入了交互控制机制,将视频生成过程以驱动实体的运动为条件进行调控。定性结果表明,InterDyn能生成符合物理规律、时序一致的复杂物体交互视频,并能泛化至未见过的物体。定量评估显示,InterDyn在静态状态转移预测的基线方法上表现更优。本工作彰显了利用视频生成模型作为隐式物理引擎的巨大潜力。