While deep reinforcement learning has shown important empirical success, it tends to learn relatively slow due to slow propagation of rewards information and slow update of parametric neural networks. Non-parametric episodic memory, on the other hand, provides a faster learning alternative that does not require representation learning and uses maximum episodic return as state-action values for action selection. Episodic memory and reinforcement learning both have their own strengths and weaknesses. Notably, humans can leverage multiple memory systems concurrently during learning and benefit from all of them. In this work, we propose a method called Two-Memory reinforcement learning agent (2M) that combines episodic memory and reinforcement learning to distill both of their strengths. The 2M agent exploits the speed of the episodic memory part and the optimality and the generalization capacity of the reinforcement learning part to complement each other. Our experiments demonstrate that the 2M agent is more data efficient and outperforms both pure episodic memory and pure reinforcement learning, as well as a state-of-the-art memory-augmented RL agent. Moreover, the proposed approach provides a general framework that can be used to combine any episodic memory agent with other off-policy reinforcement learning algorithms.
翻译:尽管深度强化学习在实证上取得了重要成功,但由于奖励信息传播缓慢和参数化神经网络更新迟缓,其学习速度相对较慢。另一方面,非参数化情景记忆提供了一种更快的替代学习方案,它无需表征学习,而是利用最大情景回报作为状态-动作值来进行动作选择。情景记忆和强化学习各有其优缺点。值得注意的是,人类在学习过程中能够同时利用多个记忆系统,并从所有系统中获益。在这项工作中,我们提出了一种名为双记忆强化学习智能体(2M)的方法,该方法结合了情景记忆和强化学习,融合了两者的优势。2M智能体利用情景记忆部分的速度以及强化学习部分的最优性和泛化能力来相互补充。我们的实验表明,2M智能体具有更高的数据效率,其性能优于纯情景记忆和纯强化学习,以及当前最先进的记忆增强型强化学习智能体。此外,所提出的方法提供了一个通用框架,可用于将任何情景记忆智能体与其他离策略强化学习算法相结合。