Learning algorithms, like Quality-Diversity (QD), can be used to acquire repertoires of diverse robotics skills. This learning is commonly done via computer simulation due to the large number of evaluations required. However, training in a virtual environment generates a gap between simulation and reality. Here, we build upon the Reset-Free QD (RF-QD) algorithm to learn controllers directly on a physical robot. This method uses a dynamics model, learned from interactions between the robot and the environment, to predict the robot's behaviour and improve sample efficiency. A behaviour selection policy filters out uninteresting or unsafe policies predicted by the model. RF-QD also includes a recovery policy that returns the robot to a safe zone when it has walked outside of it, allowing continuous learning. We demonstrate that our method enables a physical quadruped robot to learn a repertoire of behaviours in two hours without human supervision. We successfully test the solution repertoire using a maze navigation task. Finally, we compare our approach to the MAP-Elites algorithm. We show that dynamics awareness and a recovery policy are required for training on a physical robot for optimal archive generation. Video available at https://youtu.be/BgGNvIsRh7Q
翻译:学习算法(如质量多样性算法)可用于获取多样化机器人技能的技能库。由于需要大量评估,这类学习通常通过计算机模拟进行。然而,在虚拟环境中训练会产生模拟与现实的差异。本文在免复位质量多样性算法的基础上,直接在实体机器人上学习控制器。该方法利用从机器人与环境交互过程中习得的动力学模型来预测机器人行为并提升样本效率。行为选择策略可滤除模型预测的无趣或危险策略。免复位质量多样性算法还包含一个恢复策略,当机器人走出安全区域时将其带回安全区域,从而实现持续学习。我们证明该方法能使实体四足机器人在无人工监督的情况下于两小时内习得行为技能库。通过迷宫导航任务成功测试了该技能库。最后,我们将本方法与MAP-Elites算法进行对比,表明动力学感知与恢复策略是实体机器人训练中实现最优存档生成的关键。视频演示地址:https://youtu.be/BgGNvIsRh7Q