The incorporation of world modeling into manipulation policy learning has pushed the boundary of manipulation performance. However, existing efforts simply model the 2D visual dynamics, which is insufficient for robust manipulation when target tasks involve prominent depth-wise movement. To address this, we present a 3D dynamics-aware manipulation framework that seamlessly integrates 3D world modeling and policy learning. Three self-supervised learning tasks (current depth estimation, future RGB-D prediction, 3D flow prediction) are introduced within our framework, which complement each other and endow the policy model with 3D foresight. Extensive experiments on simulation and the real world show that 3D foresight can greatly boost the performance of manipulation policies without sacrificing inference speed. Code is available at https://github.com/Stardust-hyx/3D-Foresight.
翻译:将世界模型融入操作策略学习已显著提升了操作性能的边界。然而,现有方法仅对二维视觉动态进行建模,当目标任务涉及显著的深度方向运动时,这种建模方式不足以实现鲁棒操作。为解决此问题,我们提出了一种三维动态感知操作框架,该框架无缝集成了三维世界建模与策略学习。我们在框架中引入了三项自监督学习任务(当前深度估计、未来RGB-D预测、三维流预测),这些任务相互补充,使策略模型具备三维前瞻能力。在仿真和真实环境中进行的大量实验表明,三维前瞻能力能够在不牺牲推理速度的前提下,极大提升操作策略的性能。代码发布于 https://github.com/Stardust-hyx/3D-Foresight。