AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero's search requires accurate value estimates for the states appearing in its search tree. AlphaZero trains upon self-play matches beginning from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from varied starting states enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, we show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. We also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and demonstrate that Go-Exploit has a more effective search control strategy. Furthermore, Go-Exploit's sample efficiency improves when KataGo's other innovations are incorporated.
翻译:AlphaZero是一种通过策略迭代在象棋、将棋和围棋中达到超人类水平的自我对弈强化学习算法。为了成为有效的策略改进算子,AlphaZero的搜索需要对其搜索树中出现的状态进行精确的价值估计。AlphaZero从游戏的初始状态开始进行自我对弈训练,且仅在前几步中采样动作,这限制了对游戏树深层状态的探索。我们提出了Go-Exploit——一种针对AlphaZero的新型搜索控制策略。Go-Exploit从感兴趣的状态存档中采样其自我对弈轨迹的起始状态。通过从多样化的起始状态开始自我对弈轨迹,Go-Exploit能够更有效地探索游戏树,并学习到泛化能力更强的价值函数。生成更短的自我对弈轨迹使Go-Exploit能够在更多独立的价值目标上进行训练,从而改进价值训练。最后,Go-Exploit固有的探索性减少了其对探索性动作的需求,使其能够在更具利用性的策略下进行训练。在四子棋和9x9围棋游戏中,我们证明Go-Exploit比标准AlphaZero具有更高的样本效率,从而在与参考对手及直接对抗中表现出更强的性能。我们还比较了Go-Exploit与KataGo(一种更高效的AlphaZero重新实现),并证明Go-Exploit具有更有效的搜索控制策略。此外,当结合KataGo的其他创新时,Go-Exploit的样本效率进一步提升。