With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about 19% in overall performance and 37.5% in challenging scenes where the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available on our project website https://m2-t2.github.io.
翻译:随着大语言模型和大规模机器人数据集的出现,物体操作中的高层决策取得了巨大进展。这些通用模型能够通过语言指令理解复杂任务,但由于底层动作原语的局限性,它们往往难以泛化到分布外物体。相比之下,现有任务特定模型擅长对未知物体进行底层操作,但仅适用于单一动作类型。为弥合这一差距,我们提出M2T2——一个单一模型,可提供不同类型的底层动作,并在杂乱场景中对任意物体实现稳健操作。M2T2是一个Transformer模型,它能根据场景原始点云推理接触点,并预测不同动作模式下的有效夹爪姿态。在包含12.8万场景的大规模合成数据集上训练后,M2T2在真实机器人上实现了零样本模拟到现实迁移,在整体性能上比采用最先进任务特定模型的基线系统提升约19%,在需要重新定向物体以实现无碰撞放置的挑战性场景中提升37.5%。M2T2还在RLBench中语言条件任务的子集上取得了最优结果。在真实世界和仿真环境中对未知物体的机器人实验视频可访问项目网站https://m2-t2.github.io。