Advances in reinforcement learning (RL) often rely on massive compute resources and remain notoriously sample inefficient. In contrast, the human brain is able to efficiently learn effective control strategies using limited resources. This raises the question whether insights from neuroscience can be used to improve current RL methods. Predictive processing is a popular theoretical framework which maintains that the human brain is actively seeking to minimize surprise. We show that recurrent neural networks which predict their own sensory states can be leveraged to minimise surprise, yielding substantial gains in cumulative reward. Specifically, we present the Predictive Processing Proximal Policy Optimization (P4O) agent; an actor-critic reinforcement learning agent that applies predictive processing to a recurrent variant of the PPO algorithm by integrating a world model in its hidden state. Even without hyperparameter tuning, P4O significantly outperforms a baseline recurrent variant of the PPO algorithm on multiple Atari games using a single GPU. It also outperforms other state-of-the-art agents given the same wall-clock time and exceeds human gamer performance on multiple games including Seaquest, which is a particularly challenging environment in the Atari domain. Altogether, our work underscores how insights from the field of neuroscience may support the development of more capable and efficient artificial agents.
翻译:强化学习(RL)的进展通常依赖大量计算资源,且以样本效率低下著称。相比之下,人类大脑能够在有限资源下高效学习有效控制策略。这引发一个问题:神经科学的洞见能否用于改进当前RL方法?预测处理是一种流行的理论框架,认为人脑主动寻求最小化意外。我们证明,预测自身感官状态的递归神经网络可用于最小化意外,从而在累积奖励上获得显著提升。具体而言,我们提出预测处理近端策略优化(P4O)智能体;这是一种行动者-评论家强化学习智能体,通过将世界模型整合到隐藏状态中,将预测处理应用于PPO算法的递归变体。即使不进行超参数调优,P4O在单个GPU上多个Atari游戏中显著超越基线递归PPO算法。在相同时钟时间下,它还能超越其他最先进智能体,并在包括Seaquest(Atari领域中一个特别具有挑战性的环境)在内的多个游戏中超越人类玩家表现。总之,我们的工作强调了神经科学领域的洞见如何支持开发更高效、更强大的人工智能体。