The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based policy called PolarNet for language-guided manipulation. It leverages carefully designed point cloud inputs, efficient point cloud encoders, and multimodal transformers to learn 3D point cloud representations and integrate them with language instructions for action prediction. PolarNet is shown to be effective and data efficient in a variety of experiments conducted on the RLBench benchmark. It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning. It also achieves promising results on a real robot.
翻译:让机器人根据自然语言指令理解并执行操作任务是机器人领域的长期目标。当前主流的语言引导操作方法主要基于2D图像表示,这类方法在融合多视角相机与推断精确三维位置及空间关系方面存在困难。为解决上述局限,我们提出了一种基于3D点云的策略——PolarNet,用于语言引导操作。该方法通过精心设计的点云输入、高效的点云编码器及多模态Transformer,学习3D点云表征,并将其与语言指令融合以实现动作预测。在RLBench基准上的系列实验表明,PolarNet兼具高效性与数据利用效率,在单任务与多任务学习场景中均优于现有最优的2D与3D方法。此外,该算法在真实机器人平台上亦取得了令人瞩目的成果。