Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models.
翻译:继在自然语言处理和计算机视觉领域取得成功后,基于大规模多任务数据集预训练的基础模型在机器人领域也展现出巨大潜力。然而,现有的大多数机器人基础模型仅依赖于2D图像观测,忽略了对于机器人感知和理解三维世界至关重要的3D几何信息。本文提出了FP3,首个用于机器人操作的大规模3D基础策略模型。FP3基于可扩展的扩散Transformer架构构建,并在包含点云观测的6万条轨迹数据上进行预训练。通过模型设计和多样化的预训练数据,FP3能够在下游任务中高效微调,同时展现出强大的泛化能力。在真实机器人上的实验表明,仅需80次演示,FP3就能在包含未见物体的新环境中学习新任务,成功率超过90%,显著超越了现有的机器人基础模型。