Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at https://gs-dynamics.github.io.
翻译:机器人操作物体的视频蕴含了丰富的物体动力学信息。然而,现有的视频预测方法通常未显式考虑视频中的三维信息(如机器人动作与物体的三维状态),这限制了其在真实世界机器人应用中的实用性。本研究提出一个框架,通过显式考虑机器人的动作轨迹及其对场景动力学的影响,直接从多视角RGB视频中学习物体动力学。我们利用三维高斯泼溅(3DGS)的三维高斯表示,通过图神经网络训练一个基于粒子的动力学模型。该模型在从密集追踪的三维高斯重建中下采样的稀疏控制粒子上运行。通过在离线机器人交互数据上学习神经动力学模型,我们的方法能够预测物体在不同初始配置和未见机器人动作下的运动。高斯的三维变换可通过控制粒子的运动进行插值,从而能够渲染预测的未来物体状态,并实现动作条件化的视频预测。该动力学模型也可应用于基于模型的规划框架以完成物体操控任务。我们在多种可变形材料(包括绳索、布料和毛绒玩具)上进行了实验,证明了本框架对复杂形状与动力学建模的有效性。项目页面详见 https://gs-dynamics.github.io。