While deep reinforcement learning has shown important empirical success, it tends to learn relatively slow due to slow propagation of rewards information and slow update of parametric neural networks. Non-parametric episodic memory, on the other hand, provides a faster learning alternative that does not require representation learning and uses maximum episodic return as state-action values for action selection. Episodic memory and reinforcement learning both have their own strengths and weaknesses. Notably, humans can leverage multiple memory systems concurrently during learning and benefit from all of them. In this work, we propose a method called Two-Memory reinforcement learning agent (2M) that combines episodic memory and reinforcement learning to distill both of their strengths. The 2M agent exploits the speed of the episodic memory part and the optimality and the generalization capacity of the reinforcement learning part to complement each other. Our experiments demonstrate that the 2M agent is more data efficient and outperforms both pure episodic memory and pure reinforcement learning, as well as a state-of-the-art memory-augmented RL agent. Moreover, the proposed approach provides a general framework that can be used to combine any episodic memory agent with other off-policy reinforcement learning algorithms.
翻译:尽管深度强化学习在实践中取得了重要的成功,但由于奖励信息的缓慢传播和参数化神经网络的缓慢更新,其学习速度相对较慢。另一方面,非参数化情景记忆提供了一种更快速的学习替代方案,它不需要表征学习,并使用最大情景回报作为状态-动作值来进行动作选择。情景记忆和强化学习各有其优缺点。值得注意的是,人类在学习过程中能够同时利用多个记忆系统,并从所有系统中获益。在这项工作中,我们提出了一种名为双记忆强化学习智能体(2M)的方法,该方法结合了情景记忆和强化学习,以融合两者的优势。2M智能体利用情景记忆部分的快速性以及强化学习部分的最优性和泛化能力,使两者相互补充。我们的实验表明,2M智能体在数据效率上更优,且性能超越纯情景记忆、纯强化学习以及最先进的记忆增强强化学习智能体。此外,所提出的方法提供了一个通用框架,可用于将任何情景记忆智能体与其他离线策略强化学习算法相结合。