We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of approximately solving a sequence of optimization problems where the objective function can change per iteration. The common approach to solving the problem is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain their own internal parameters such as estimates of the first and the second moment of the gradient, and update these parameters over time. Therefore, information obtained in previous iterations is being used to solve the optimization problem in the current iteration. We hypothesize that this can contaminate the internal parameters of the employed optimizer in situations where the optimization landscape of the previous iterations is quite different from the current iteration. To hedge against this effect, a simple idea is to reset the internal parameters of the optimizer when starting a new iteration. We empirically investigate this resetting strategy by employing various optimizers in conjunction with the Rainbow algorithm. We demonstrate that this simple modification unleashes the true potential of modern optimizers, and significantly improves the performance of deep RL on the Atari benchmark.
翻译:我们聚焦于深度强化学习中最优值函数逼近的任务。这一迭代过程由近似求解一系列优化问题组成,其中目标函数每轮迭代都可能变化。解决该问题的常见方法是采用随机梯度下降算法的现代变体,例如Adam。这些优化器维护自身内部参数,如梯度一阶矩和二阶矩的估计,并随时间更新这些参数。因此,先前迭代中获得的信息被用于解决当前迭代的优化问题。我们假设,在当前迭代的优化地形与先前迭代存在较大差异时,这种做法可能污染优化器的内部参数。为规避此影响,一个简单的思路是在开始新一轮迭代时重置优化器的内部参数。我们通过将各种优化器与Rainbow算法结合使用,对该重置策略进行了实证研究。结果表明,这一简单修改释放了现代优化器的真正潜力,并显著提升了深度强化学习在Atari基准测试上的性能表现。