This paper proposes new, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Different from prior efforts on training agents to beat a fixed set of opponents, our objective is to find the Nash equilibrium policies that are free from exploitation by even the adversarial opponents. We propose (a) Nash-DQN algorithm, which integrates the deep learning techniques from single DQN into the classic Nash Q-learning algorithm for solving tabular Markov games; (b) Nash-DQN-Exploiter algorithm, which additionally adopts an exploiter to guide the exploration of the main agent. We conduct experimental evaluation on tabular examples as well as various two-player Atari games. Our empirical results demonstrate that (i) the policies found by many existing methods including Neural Fictitious Self Play and Policy Space Response Oracle can be prone to exploitation by adversarial opponents; (ii) the output policies of our algorithms are robust to exploitation, and thus outperform existing methods.
翻译:本文提出了新的端到端深度强化学习算法,用于学习双人零和马尔可夫博弈。与以往训练智能体击败固定对手集的方法不同,我们的目标是寻找不受对抗性对手利用的纳什均衡策略。我们提出:(a) Nash-DQN算法,该算法将单DQN的深度学习技术融入经典纳什Q学习算法,用于解决表格型马尔可夫博弈;(b) Nash-DQN-Exploiter算法,该算法额外采用一个利用者来引导主智能体的探索。我们在表格型示例以及多种双人Atari游戏上进行了实验评估。实证结果表明:(i) 包括神经虚拟自我对弈和策略空间反应函数在内的许多现有方法所找到的策略容易被对抗性对手利用;(ii) 我们算法输出的策略对利用具有鲁棒性,因此优于现有方法。