Human brain and behavior provide a rich venue that can inspire novel control and learning methods for robotics. In an attempt to exemplify such a development by inspiring how humans acquire knowledge and transfer skills among tasks, we introduce a novel multi-task reinforcement learning framework named Episodic Return Progress with Bidirectional Progressive Neural Networks (ERP-BPNN). The proposed ERP-BPNN model (1) learns in a human-like interleaved manner by (2) autonomous task switching based on a novel intrinsic motivation signal and, in contrast to existing methods, (3) allows bidirectional skill transfer among tasks. ERP-BPNN is a general architecture applicable to several multi-task learning settings; in this paper, we present the details of its neural architecture and show its ability to enable effective learning and skill transfer among morphologically different robots in a reaching task. The developed Bidirectional Progressive Neural Network (BPNN) architecture enables bidirectional skill transfer without requiring incremental training and seamlessly integrates with online task arbitration. The task arbitration mechanism developed is based on soft Episodic Return progress (ERP), a novel intrinsic motivation (IM) signal. To evaluate our method, we use quantifiable robotics metrics such as 'expected distance to goal' and 'path straightness' in addition to the usual reward-based measure of episodic return common in reinforcement learning. With simulation experiments, we show that ERP-BPNN achieves faster cumulative convergence and improves performance in all metrics considered among morphologically different robots compared to the baselines.
翻译:人脑及其行为为机器人控制与学习方法的创新提供了丰富灵感。受人类知识获取与跨任务技能迁移机制的启发,我们提出了一种名为“基于双向渐进神经网络的阶段性回报进步”的多任务强化学习框架(ERP-BPNN)。所提出的ERP-BPNN模型具有以下特点:(1)以类人交错方式学习;(2)基于新型内在动机信号实现自主任务切换;(3)与现有方法不同,支持任务间的双向技能迁移。ERP-BPNN是一种适用于多种多任务学习场景的通用架构。本文详细阐述了其神经网络架构,并展示了其在形态差异机器人完成到达任务中实现高效学习与技能迁移的能力。所开发的双向渐进神经网络(BPNN)架构无需增量训练即可实现双向技能迁移,并能无缝集成在线任务仲裁机制。该任务仲裁机制基于软性阶段性回报进步(ERP)——一种新型内在动机(IM)信号。为评估方法性能,除强化学习中常用的阶段性回报奖励测度外,我们还采用了“目标期望距离”“路径平直度”等可量化机器人学指标。仿真实验表明,与基线方法相比,ERP-BPNN在形态差异机器人上实现了更快的累积收敛速度,并在所有评估指标上均取得更优性能。