We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common approach to solving this sequence of problems is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain their own internal parameters such as estimates of the first-order and the second-order moments of the gradient, and update them over time. Therefore, information obtained in previous iterations is used to solve the optimization problem in the current iteration. We demonstrate that this can contaminate the moment estimates because the optimization landscape can change arbitrarily from one iteration to the next one. To hedge against this negative effect, a simple idea is to reset the internal parameters of the optimizer when starting a new iteration. We empirically investigate this resetting idea by employing various optimizers in conjunction with the Rainbow algorithm. We demonstrate that this simple modification significantly improves the performance of deep RL on the Atari benchmark.
翻译:我们专注于深度强化学习中最优价值函数的近似任务。这一迭代过程涉及求解一系列优化问题,其中损失函数随迭代次数变化。解决这一系列问题的常用方法是采用随机梯度下降算法的现代变体,如Adam。这些优化器维护其内部参数,例如梯度的一阶矩和二阶矩的估计值,并随时间更新它们。因此,先前迭代中获得的信息被用于求解当前迭代的优化问题。我们证明,这可能会污染矩估计,因为优化景观在每次迭代之间可能任意变化。为防范这种负面影响,一个简单的想法是在开始新迭代时重置优化器的内部参数。我们通过结合Rainbow算法使用多种优化器来实证研究这一重置想法。我们证明,这一简单修改显著提升了深度强化学习在Atari基准测试上的性能。