GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation

Robots' ability to follow language instructions and execute diverse 3D tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in understanding novel tasks, thereby mitigating this issue. However, these methods lack a task-specific learning process, which is essential for an accurate understanding of 3D environments, often leading to execution failures. In this paper, we introduce GravMAD, a sub-goal-driven, language-conditioned action diffusion framework that combines the strengths of imitation learning and foundation models. Our approach breaks tasks into sub-goals based on language instructions, allowing auxiliary guidance during both training and inference. During training, we introduce Sub-goal Keypose Discovery to identify key sub-goals from demonstrations. Inference differs from training, as there are no demonstrations available, so we use pre-trained foundation models to bridge the gap and identify sub-goals for the current task. In both phases, GravMaps are generated from sub-goals, providing flexible 3D spatial guidance compared to fixed 3D positions. Empirical evaluations on RLBench show that GravMAD significantly outperforms state-of-the-art methods, with a 28.63% improvement on novel tasks and a 13.36% gain on tasks encountered during training. These results demonstrate GravMAD's strong multi-task learning and generalization in 3D manipulation. Video demonstrations are available at: https://gravmad.github.io.

翻译：机器人遵循语言指令并执行多样化三维任务的能力在机器人学习中至关重要。传统的基于模仿学习的方法在已见任务上表现良好，但由于任务变异性，在面对新颖、未见任务时往往表现不佳。近期方法利用大型基础模型来辅助理解新任务，从而缓解了这一问题。然而，这些方法缺乏针对具体任务的学习过程，而这对准确理解三维环境至关重要，常常导致执行失败。本文提出GravMAD，一种基于子目标驱动、语言条件化的动作扩散框架，它结合了模仿学习和基础模型的优势。我们的方法根据语言指令将任务分解为子目标，从而在训练和推理阶段都能提供辅助指导。在训练阶段，我们引入了子目标关键姿态发现机制，从演示中识别关键子目标。推理阶段与训练不同，因为没有可用的演示，因此我们使用预训练的基础模型来弥合差距，并为当前任务识别子目标。在这两个阶段中，GravMaps均从子目标生成，与固定的三维位置相比，它提供了更灵活的三维空间引导。在RLBench上的实证评估表明，GravMAD显著优于现有最先进方法，在未见任务上提升了28.63%，在训练阶段已见任务上提升了13.36%。这些结果证明了GravMAD在三维操作任务中强大的多任务学习与泛化能力。视频演示可见于：https://gravmad.github.io。