We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.
翻译:我们提出了一种新的Q学习变体,称为2RA Q-learning,它以一种原则性的方式解决了现有Q学习方法的一些缺陷。其中一个缺陷是潜在的估计偏差无法控制,且常常导致性能不佳。我们提出了一种针对最大值期望项的分佈鲁棒估计器,能够精确控制引入的估计偏差水平。该分佈鲁棒估计器具有闭式解,使得所提算法的每次迭代计算成本与Watkins' Q-learning相当。在表格情形下,我们证明了2RA Q-learning收敛到最优策略,并分析了其渐近均方误差。最后,我们在多种设置下进行了数值实验,实验证实了我们的理论发现,并表明2RA Q-learning通常比现有方法表现更好。