Targeted Search Control in AlphaZero for Effective Policy Improvement

from arxiv, This paper has been accepted to the Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023)

AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero's search requires accurate value estimates for the states appearing in its search tree. AlphaZero trains upon self-play matches beginning from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from varied starting states enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, we show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. We also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and demonstrate that Go-Exploit has a more effective search control strategy. Furthermore, Go-Exploit's sample efficiency improves when KataGo's other innovations are incorporated.

翻译：AlphaZero是一种通过策略迭代在象棋、将棋和围棋中实现超人水平的自我对弈强化学习算法。为成为有效的策略改进算子，AlphaZero的搜索需要对其搜索树中出现的状态进行准确的价值估计。AlphaZero从游戏的初始状态开始训练自对弈对局，且仅在前几步采样动作，这限制了其对游戏树深层状态的探索。我们提出Go-Exploit，一种针对AlphaZero的新型搜索控制策略。Go-Exploit从感兴趣状态存档中采样其自对弈轨迹的起始状态。从不同初始状态开始自对弈轨迹使Go-Exploit能够更有效地探索游戏树，并学习到泛化能力更强的价值函数。生成更短的自对弈轨迹允许Go-Exploit在更多独立的价值目标上进行训练，从而改进价值训练。最后，Go-Exploit固有的探索性减少了其对探索性动作的需求，使其能够在更具利用性的策略下进行训练。在四子棋和9x9围棋游戏中，我们展示出Go-Exploit比标准AlphaZero具有更高的样本效率，从而在对抗参考对手和直接对局中表现出更强的性能。我们还将Go-Exploit与KataGo（一种更节省样本的AlphaZero重实现版本）进行比较，并证明Go-Exploit具有更有效的搜索控制策略。此外，当融入KataGo的其他创新时，Go-Exploit的样本效率得到进一步提升。