This paper describes a deep reinforcement learning (DRL) approach that won Phase 1 of the Real Robot Challenge (RRC) 2021, and then extends this method to a more difficult manipulation task. The RRC consisted of using a TriFinger robot to manipulate a cube along a specified positional trajectory, but with no requirement for the cube to have any specific orientation. We used a relatively simple reward function, a combination of goal-based sparse reward and distance reward, in conjunction with Hindsight Experience Replay (HER) to guide the learning of the DRL agent (Deep Deterministic Policy Gradient (DDPG)). Our approach allowed our agents to acquire dexterous robotic manipulation strategies in simulation. These strategies were then applied to the real robot and outperformed all other competition submissions, including those using more traditional robotic control techniques, in the final evaluation stage of the RRC. Here we extend this method, by modifying the task of Phase 1 of the RRC to require the robot to maintain the cube in a particular orientation, while the cube is moved along the required positional trajectory. The requirement to also orient the cube makes the agent unable to learn the task through blind exploration due to increased problem complexity. To circumvent this issue, we make novel use of a Knowledge Transfer (KT) technique that allows the strategies learned by the agent in the original task (which was agnostic to cube orientation) to be transferred to this task (where orientation matters). KT allowed the agent to learn and perform the extended task in the simulator, which improved the average positional deviation from 0.134 m to 0.02 m, and average orientation deviation from 142{\deg} to 76{\deg} during evaluation. This KT concept shows good generalisation properties and could be applied to any actor-critic learning algorithm.
翻译:本文描述了在2021年真实机器人挑战赛第一阶段获胜的深度强化学习方法,并将该方法扩展到更困难的操作任务。该挑战要求使用TriFinger机器人沿指定位置轨迹操作立方体,但对立方体姿态无特定要求。我们采用结合基于目标的稀疏奖励与距离奖励的简单奖励函数,并引入事后经验回放以引导深度确定性策略梯度智能体的学习。该方法使智能体在仿真环境中获得灵巧机器人操作策略,这些策略被成功迁移至真实机器人,并在最终评估阶段击败了包括使用传统机器人控制技术在内的所有参赛方案。本文进一步扩展该方法:在保持原始位置轨迹要求的同时,增加立方体定向约束。由于问题复杂度增加,定向要求使智能体无法通过盲目探索习得策略。为此,我们创新性地运用知识迁移技术,将原始任务中习得的立方体姿态无关策略迁移至姿态相关的扩展任务。知识迁移使智能体在仿真器中成功学习并执行扩展任务,在评估阶段将平均位置偏差从0.134米降至0.02米,平均姿态偏差从142度降至76度。该知识迁移方案展现出良好的泛化特性,可应用于任何演员-评论家学习算法。