Expert-Free Online Transfer Learning in Multi-Agent Reinforcement Learning

Transfer learning in Reinforcement Learning (RL) has been widely studied to overcome training issues of Deep-RL, i.e., exploration cost, data availability and convergence time, by introducing a way to enhance training phase with external knowledge. Generally, knowledge is transferred from expert-agents to novices. While this fixes the issue for a novice agent, a good understanding of the task on expert agent is required for such transfer to be effective. As an alternative, in this paper we propose Expert-Free Online Transfer Learning (EF-OnTL), an algorithm that enables expert-free real-time dynamic transfer learning in multi-agent system. No dedicated expert exists, and transfer source agent and knowledge to be transferred are dynamically selected at each transfer step based on agents' performance and uncertainty. To improve uncertainty estimation, we also propose State Action Reward Next-State Random Network Distillation (sars-RND), an extension of RND that estimates uncertainty from RL agent-environment interaction. We demonstrate EF-OnTL effectiveness against a no-transfer scenario and advice-based baselines, with and without expert agents, in three benchmark tasks: Cart-Pole, a grid-based Multi-Team Predator-Prey (mt-pp) and Half Field Offense (HFO). Our results show that EF-OnTL achieve overall comparable performance when compared against advice-based baselines while not requiring any external input nor threshold tuning. EF-OnTL outperforms no-transfer with an improvement related to the complexity of the task addressed.

翻译：强化学习中的迁移学习已被广泛研究，通过引入外部知识增强训练阶段，以克服深度强化学习中的训练问题，例如探索成本、数据可用性和收敛时间。通常，知识从专家智能体迁移到新手智能体。虽然这解决了新手智能体的问题，但为了使这种迁移有效，需要对专家智能体上的任务有良好的理解。作为替代方案，本文提出了无专家在线迁移学习（EF-OnTL），一种能够在多智能体系统中实现无专家实时动态迁移学习的算法。算法中不存在专门的专家，迁移源智能体和待迁移的知识在每个迁移步骤中根据智能体的性能和不确定性动态选择。为了改进不确定性估计，我们还提出了状态-动作-奖励-下一状态随机网络蒸馏（sars-RND），这是RND的扩展，可从强化学习智能体-环境交互中估计不确定性。我们在三个基准任务中验证了EF-OnTL相对于无迁移场景和基于建议的基线（无论是否存在专家智能体）的有效性：Cart-Pole、基于网格的多团队捕食-猎物（mt-pp）以及半场进攻（HFO）。结果表明，EF-OnTL在与基于建议的基线相比时实现了整体可比性能，同时无需任何外部输入或阈值调整。EF-OnTL优于无迁移，其改进程度与所处理任务的复杂性相关。