Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.
翻译:在三维操作中获取多任务模仿策略,对场景理解和动作预测提出了挑战。当前方法同时采用三维表示和多视角二维表示来预测机器人末端执行器的位姿。然而,这些方法仍需大量高质量机器人轨迹数据,且在未见任务上泛化能力有限,在长时程推理中执行效率低下。本文提出SAM-E,一种新颖的机器人操作架构,它利用视觉基础模型实现可泛化的场景理解,并借助序列模仿进行长时程动作推理。具体而言,我们采用在海量图像和可提示掩码上预训练的Segment Anything(SAM)作为基础模型,以提取任务相关特征,并通过参数高效微调使其更好地理解具身场景。针对长时程推理,我们开发了一种新颖的多通道热图,能够单次预测整个动作序列,显著提升了执行效率。多种指令跟随任务的实验结果表明,与基线方法相比,SAM-E在取得更优性能的同时具有更高的执行效率,并且在少量样本适应新任务时也显著提升了泛化能力。