Sharing Lifelong Reinforcement Learning Knowledge via Modulating Masks

from arxiv, 25 pages, 14 figures, 9 tables, to be published in the Second Conference on Lifelong Learning Agents (CoLLAs 2023), code can be found at https://github.com/DMIU-ShELL/deeprl-shell

Lifelong learning agents aim to learn multiple tasks sequentially over a lifetime. This involves the ability to exploit previous knowledge when learning new tasks and to avoid forgetting. Modulating masks, a specific type of parameter isolation approach, have recently shown promise in both supervised and reinforcement learning. While lifelong learning algorithms have been investigated mainly within a single-agent approach, a question remains on how multiple agents can share lifelong learning knowledge with each other. We show that the parameter isolation mechanism used by modulating masks is particularly suitable for exchanging knowledge among agents in a distributed and decentralized system of lifelong learners. The key idea is that the isolation of specific task knowledge to specific masks allows agents to transfer only specific knowledge on-demand, resulting in robust and effective distributed lifelong learning. We assume fully distributed and asynchronous scenarios with dynamic agent numbers and connectivity. An on-demand communication protocol ensures agents query their peers for specific masks to be transferred and integrated into their policies when facing each task. Experiments indicate that on-demand mask communication is an effective way to implement distributed lifelong reinforcement learning and provides a lifelong learning benefit with respect to distributed RL baselines such as DD-PPO, IMPALA, and PPO+EWC. The system is particularly robust to connection drops and demonstrates rapid learning due to knowledge exchange.

翻译：终身学习智能体旨在在一生中顺序学习多个任务。这涉及在学新任务时利用先前知识并避免遗忘的能力。调制掩码作为一种特定类型的参数隔离方法，近期在监督学习和强化学习中均展现出潜力。尽管终身学习算法主要在单智能体框架下进行研究，但多智能体之间如何共享终身学习知识的问题仍有待解答。我们证明，调制掩码所用的参数隔离机制特别适用于分布式且去中心化的终身学习智能体系统间知识交换。其关键思想在于，将特定任务的知识隔离至特定掩码，使得智能体能够按需仅传递特定知识，从而实现稳健且高效的分布式终身学习。我们假设完全分布式和异步场景，其中智能体数量与连接动态变化。按需通信协议确保智能体在面对每个任务时，向同伴查询特定掩码以将其转移并整合至自身策略中。实验表明，按需掩码通信是实现分布式终身强化学习的有效方法，相较于DD-PPO、IMPALA和PPO+EWC等分布式RL基线，能带来终身学习优势。该系统对连接中断尤其鲁棒，并且因知识交换而展现出快速学习能力。