In this paper, we build upon two major recent developments in the field, Diffusion Policies for visuomotor manipulation and large pre-trained multimodal foundational models to obtain a robotic skill learning system. The system can obtain new skills via the behavioral cloning approach of visuomotor diffusion policies given teleoperated demonstrations. Foundational models are being used to perform skill selection given the user's prompt in natural language. Before executing a skill the foundational model performs a precondition check given an observation of the workspace. We compare the performance of different foundational models to this end as well as give a detailed experimental evaluation of the skills taught by the user in simulation and the real world. Finally, we showcase the combined system on a challenging food serving scenario in the real world. Videos of all experimental executions, as well as the process of teaching new skills in simulation and the real world, are available on the project's website.
翻译:本文基于该领域两项最新进展——视觉运动操作的扩散策略和大规模预训练多模态基础模型,构建了一套机器人技能学习系统。该系统可通过遥操作演示的视觉运动扩散策略行为克隆方法获取新技能,并利用基础模型根据用户自然语言提示执行技能选择。在执行技能前,该基础模型将对工作空间观测进行前提条件检查。我们比较了不同基础模型在此任务中的性能,并在仿真和真实环境中对用户教授的技能进行了详细的实验评估。最后,我们在真实场景中演示了系统在极具挑战性的食物服务场景中的综合性能。所有实验执行过程以及仿真和真实环境中教授新技能的视频均已上传至项目网站。