We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples by combatting value function divergence. Under large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we investigate the phenomena leading to the primacy bias. We inspect the early stages of training that were conjectured to cause the failure to learn and find that one fundamental challenge is a long-standing acquaintance: value function divergence. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be linked to overestimation on unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting early data.
翻译:本文研究表明,通过应对价值函数发散问题,深度强化学习算法能够在梯度更新次数远超环境样本数量的场景下保持学习能力,而无需重置网络参数。在更新数据比极高的情况下,Nikishin等人(2022)的最新研究指出会出现"先验偏好"现象:智能体过度拟合早期交互数据而弱化后续经验,从而损害其学习能力。本研究深入探究导致先验偏好现象的内在机理。通过检视被推测为引发学习失效的训练早期阶段,我们发现一个根本性挑战实为长期存在的经典问题:价值函数发散。研究发现Q值膨胀现象不仅出现在分布外数据,同样存在于分布内数据,这可能与优化器动量驱动的未见动作预测高估有关。我们采用简单的单位球归一化方法,使得模型能够在大更新数据比条件下持续学习,在广泛使用的dm_control测试平台上验证了该方法的有效性,并在高难度的dog任务中取得了与基于模型的方法相媲美的优异性能。我们的研究成果部分地质疑了先前将次优学习归因于早期数据过拟合的解释。