The ability to learn continually is essential in a complex and changing world. In this paper, we characterize the behavior of canonical value-based deep reinforcement learning (RL) approaches under varying degrees of non-stationarity. In particular, we demonstrate that deep RL agents lose their ability to learn good policies when they cycle through a sequence of Atari 2600 games. This phenomenon is alluded to in prior work under various guises -- e.g., loss of plasticity, implicit under-parameterization, primacy bias, and capacity loss. We investigate this phenomenon closely at scale and analyze how the weights, gradients, and activations change over time in several experiments with varying dimensions (e.g., similarity between games, number of games, number of frames per game), with some experiments spanning 50 days and 2 billion environment interactions. Our analysis shows that the activation footprint of the network becomes sparser, contributing to the diminishing gradients. We investigate a remarkably simple mitigation strategy -- Concatenated ReLUs (CReLUs) activation function -- and demonstrate its effectiveness in facilitating continual learning in a changing environment.
翻译:持续学习能力在复杂多变的环境中至关重要。本文刻画了典型的基于价值的深度强化学习方法在不同非平稳程度下的行为特征。具体而言,我们证明当深度强化学习代理循环遍历一系列Atari 2600游戏时,其习得优质策略的能力会丧失。先前研究以多种形式提及该现象——例如可塑性丧失、隐式欠参数化、首因偏差与容量衰减。我们在大规模条件下深入研究该现象,通过多组不同维度实验(如游戏间相似度、游戏数量、每局游戏帧数)分析权重、梯度与激活值随时间的变化规律,部分实验持续50天并经历20亿次环境交互。分析表明,网络激活足迹逐渐稀疏化,导致梯度衰减。我们研究了一种极为简单的缓解策略——拼接ReLU激活函数——并证明其在动态环境中促进持续学习的有效性。