The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias thourgh dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments.
翻译:本文旨在提出一种带有虚拟对抗玩家(dummy adversarial player)的新型Q学习算法,称为虚拟对抗Q学习(DAQ),该算法能有效调控标准Q学习中的过高估计偏差。借助虚拟玩家,学习过程可被建模为双人零和博弈。所提出的DAQ将多种用于控制过高估计偏差的Q学习变体(如最大最小Q学习与本文提出的最小最大Q学习)统一于单一框架中。DAQ通过虚拟对抗行为以简单有效的方式抑制过高估计偏差,并能便捷地应用于现成的强化学习算法以提升性能。本文从整合视角采用对抗Q学习方法分析了DAQ的有限时间收敛性。在多种基准环境下,通过实验证明了所提DAQ的性能。